Вы находитесь на странице: 1из 274

R Help Type: ?function.name args(function.name) function.name example(function.name) ??function.name help(package="package.

name") RSiteSearch("key phrase") RSiteSearch("{key phrase}") apropos("key phrase") find("function") news() news(Version == "v.9", package = "package.name") news(grepl("key phrase", Text), db= news()) help.start() library(sos); findFn("key phrase") or ???key.phrase; ???'word1 word2' sessionInfo() Additional Websites Website: rseek.org stackoverflow.com stats.stackexchange.com Purpose: [R] version of google Q & A forum generally oriented to code and programing Q & A forum generally oriented to statistics Search through packages for key words crantastic.org use CTRL + f and search by anything including author cran.r-project.org/web/views Search through packages by area (psychometrics, cluster, etc.) inside-r.org/ R community site zoonek2.free.fr/UNIX/48_R/all.html Helpful Book/Manual Website (very thorough) statmethods.net/ Helpful Book/Manual Website (very detailed) Sample [R] capabilities demo(persp); demo(graphics); demo(Hershey); demo(plotmath) Cool R Visualization Examples
http://paulbutler.org/archives/visualizing-facebook-friends/ http://blog.revolutionanalytics.com/2012/01/nyt-uses-r-to-map-the-1.html http://blog.revolutionanalytics.com/2009/11/choropleth-challenge-result.html http://www.r-bloggers.com/visualize-your-facebook-friends-network-with-r/ http://www.r-bloggers.com/see-the-wind/ http://www.r-bloggers.com/mapped-british-and-spanish-shipping-1750-1800/

Commands for R Purpose: Help page on specific function Arguments for a particular function Code of a function Example(s) of a function in action For when ?function.name doesnt work Information on a package Search from within [R], the R site for key phrases Search from within [R], the R site for exact phrases Returns a list of all matching objects in the search list Returns the package the function is in Find out new things happening in [R] Find out new things happening with a package Search for key words in news() Opens a CRAN page for statistical analysis Search for rated functions related to a topic Info about current session (including loaded packages)

1|Page

Packages & Libraries Working with packages library(package.name) require(package.name) .libPaths() detach(package:package.name, unload = TRUE) install.packages("package.name") library(fBasics) #then listFunctions("stats") objects(package:package.name) Loads library Loads library Prints the path(s) to R's library(s) Removes package from directory. Install a package from command line.

List functions from a library List functions from a library (preferred: from base) installed.packages() [,1] What packages do you have installed on your computer maintainer("package.name") Get the name and email of the package maintainer packageDescription("package.name") Brief info about a packages contents packageDescription("package.name")["Version"] Find version number of package remove.packages("package.name") Delete a package from your library data(package = "package.name") Look at data sets available in a package library() Look at all available libraries vignette() Look at all available vignettes for installed libraries vignette("package.name") Look at vignettes for a libraries (.packages()) Current packages loaded search() Current packages loaded library(help="package.name") See contents of a package package.name::object Access library object w/o opening package
Especially good if 2 packages have the same named function

list.files(.libPaths()[1]) list.files(.libPaths()) .packages(all=TRUE)) (.libPaths()) suppressPackageStartupMessages(library(packa ge.name)) Install Package Not Compiled for Windows

See what files are in your saved library Shows all available packes including standard install libs Shows all available packages Displays the paths to all your library locations Supress the startup message of a package

install.packages("PATH/TO/THE/SVGAnnotation.tar.gz", repos=NULL, type="source") Citing [R] and packages citation() #citing [R] citation(package = "psych", lib.loc = NULL, auto = NULL) #bibtex citation of a package method 1 utils:::print.bibentry(citation("psych"), style = "Bibtex") #bibtex citation of a package method 2 2|Page

Importing a data from Excel, (csv), text, HTML Make sure the excel file is saved as a .csv file in the folder containing the route directory of R. <-read.csv(file, header=TRUE, strip.white = TRUE, sep=",", as.is=FALSE, na.strings= c("999", "NA", " ")) <-read.delim(file, header=T, strip.white = T, sep="\t", as.is=FALSE, na.strings= c("999", "NA", " ")) library(XML) #which is the table number to return <-readHTMLTable(doc, which=#, header=T, strip.white = T, as.is=FALSE, sep=",", na.strings= c("999", "NA", " ")) <-read.fwf(file,widths, header=FALSE, strip.white = T, sep=" ", as.is=FALSE, na.strings= c("999", "NA", " ")) library(gdata) <-read.xls(file,sheet=1, header=FALSE, strip.white=T,sep=" ", as.is=FALSE, na.strings= c("999", "NA", " ")) require(foreign) #for SPSS <-read.spss(file, use.value.labels = TRUE, to.data.frame = TRUE) Use , as.is=FALSE, to keep not convert character to factor
#example 1 library(XML) URL <- "http://library.columbia.edu/indiv/dssc/data/nycounty_fips.html" Table <- readHTMLTable(URL, colClasses = rep("character", 2), skip.rows=1, which=1) names(Table) <- c("County_FIPS", "County_Name") Table #example 2 library(XML) URL2 <- "http://en.wikipedia.org/wiki/List_of_counties_in_New_York" Table2 <- readHTMLTable(URL2, which=2) Table2 #needs to be cleaned

Using File Choose <-read.table(file.choose(),sep=",",header=T, strip.white = T,na.strings=c("999","NA"," ")) For Fixed Column Width Files: Save as Plain text, and import into Excel using file/open and follow the steps. Then Export as a .csv file. [or use read.fwf()] <-read.csv(file.choose(),strip.white = header=TRUE, sep=",",na.strings="999") Exporting a data table to Excel write.table(x, file = "foo.csv", sep = ",", col.names = T, row.names=F, qmethod = "double") write.table(x, file = "foo.csv", sep = ",", col.names = NA, qmethod = "double") Exporting a data table to SAS library(SASxport); write.xport(...dataframe(s)..., file=) Keyboard Short Cuts Clear console cntrl + L Load script lines cntrl + R Load all script cntrl + A and then cntrl + R Load content to console from non interactive window (ie history() etc) cntrl + c; cntrl + v Go to the beginning or end of script cntrl + HOME; cntrl + END Highlight from a given point to beginning or end cntrl + + SHIFT + HOME; cntrl + SHIFT + END 3|Page

Write and Read in Vector Files of Unequal Length write.unequal(, csv.name) read.unequal(file)
#=============WRITE=A=CSV=FILE=OF=UNEQUAL=LENGTHS================ Vector1 <- 1:6 Vector2 <- LETTERS[1:9] Vector3 <- c('the', 'quick', 'red', 'fox', 'jumped', 'over', 'the', 'lazy', 'brown', 'dog') Vector4 <- c(.1, .3, .6, .4) lst <- list(Vector1, Vector2, Vector3, Vector4) lns <- sapply(lst, length) n <- length(lst) ans <- as.data.frame(matrix(nrow = max(lns), ncol = n)) for(i in 1:n){ ans[1:lns[i], i] <- lst[[i]] } ans write.csv(ans, file = "DELETE.ME.csv", na = "", row.names = FALSE) #=============READ=IN=A=FILE=OF=UNEQUAL=LENGTHS================== x <- "DELETE.ME.csv" #THE CSV OF != LENGTH VECTORS j <- read.csv(x, stringsAsFactors = F) k <- lapply(as.list(j), function(x){x[!is.na(x)]}) #======================DELETE=THE=FILE=========================== delete(x) ################################################################### # NOTE: I WRAPPED THIS ALL UP IN A FUNCTION I KEEP INT HE USEFUL # # FUNCTIONS FILE LOADED BY .FIRST AS SEEN BELOW # ################################################################### write.unequal(Vector1, Vector2, Vector3, Vector4, csv.name=".DELETE.ME") read.unequal(".DELETE.ME.csv")

4|Page

Read in ascii type files (see my created function to right) read.table(name<-textConnection("")); close(name)
site.data <- read.table(tc<-textConnection( "site year peak 1 ALBEN 5 101529.6 2 ALBEN 10 117483.4 3 ALBEN 20 132960.9 8 ALDER 5 6561.3 9 ALDER 10 7897.1 10 ALDER 20 9208.1 15 AMERI 5 43656.5 16 AMERI 10 51475.3 17 AMERI 20 58854.4")); close(tc) site.data site.data[,1]

#created read in function ascii <- function(x, header=TRUE){ name <-textConnection(x) DF <- read.table(name,header) close(name) DF } ascii("site year peak 1 ALBEN 5 101529.6 2 ALBEN 10 117483.4 3 ALBEN 20 132960.9 8 ALDER 5 6561.3 9 ALDER 10 7897.1 10 ALDER 20 9208.1 15 AMERI 5 43656.5 16 AMERI 10 51475.3 17 AMERI 20 58854.4")

Alternative Method

Read in ascii type files read.table(text="", header=TRUE)


read.table(text="site year 1 ALBEN 5 101529.6 2 ALBEN 10 117483.4 3 ALBEN 20 132960.9 8 ALDER 5 6561.3 9 ALDER 10 7897.1 10 ALDER 20 9208.1 15 AMERI 5 43656.5 16 AMERI 10 51475.3 17 AMERI 20 58854.4") peak

Give comments to an object object comment dataframe commen scommentt comment(object) comment(object) <- value

#EXAMPLE: x <- mtcars comment(x) <- c("This is about cars #0234", "I know nothing about cars") x comment(x) str(x)

Object Characteristics str() names() attributes() comment() getAnywhere() #look at any code example
getAnywhere(plot.table)

mod<-lm(disp~hp+cyl, mtcars) str(mod) attributes(mod) names(mod) str(mtcars) attributes(mtcars) names(mtcars) #WATHCH OUT FOR METHODS CLASSES library(tm); library(proxy) dissimilarity #notice the UseMethod (that tells you to look at the methods methods(dissimilarity) #notice there are three different methods types getAnywhere("dissimilarity.DocumentTermMatrix") #works or tm:::dissimilarity.DocumentTermMatrix #so does :::

5|Page

Merging data sets (by column) merge(x, y, by = intersect(names(x), names(y)), by.x = by, by.y = by, all = FALSE, all.x = all, all.y = all, sort = TRUE, suffixes = c(".x",".y")) x, y by, by.x, by.y all all.x data frames, or objects to be coerced to one. specifications of the common columns. See Details.

logical; all = L is shorthand for all.x = L and all.y = L. logical; if TRUE, then extra rows will be added to the output, one for each row in x that has no matching row in y. These rows will have NAs in those columns that are usually filled with values from y. The default is FALSE, so that only rows with data from both x and y are included in the output. all.y logical; analogous to all.x above. sort logical. Should the results be sorted on the by columns?

Merging data sets (by row) and fill missing with NA library(plyr) merge by ro merge rows rbind.fill(dataframes)
#EXAMPLE x1<-LETTERS[1:3] x2<-letters[1:3] x2b<-letters[5:7] x3<-rnorm(3) x4<-rnorm(3) x5<-rnorm(3) #DATA LOOKS LIKE THIS data.frame(x1,x2,x3,x4,x5) data.frame(x1,x3,x4,x5) data.frame(x2,x3,x4,x5) data.frame(x1,x2,x3,x4,x5) data.frame(x1,x2b,x3,x4,x5) #======================================== # IF EACH ONE IS A DATA FRAME ALREADY #======================================== library(plyr) d1 <- data.frame(x1,x2,x3,x4,x5) d2 <- data.frame(x1,x3,x4,x5) d3 <- data.frame(x2,x3,x4,x5) d4 <- data.frame(x1,x2,x3,x4,x5) d5 <- data.frame(x1,x2b,x3,x4,x5) rbind.fill(d1,d2,d3,d4,d5) #======================================== # IF EACH ONE IS A DATA FRAME ALREADY #======================================== library(plyr) LIST<-list( data.frame(x1, x2, x3, x4, x5), data.frame(x1,x3,x4,x5), data.frame(x2,x3,x4,x5), data.frame(x1,x2,x3,x4,x5), data.frame(x1,x2b,x3,x4,x5)) DF <- rbind.fill(LIST) data.frame(FAC(DF), NUM(DF))
#Output x1 1 A 2 B 3 C 4 A 5 B 6 C 7 <NA> 8 <NA> 9 <NA> 10 A 11 B 12 C 13 A 14 B 15 C x2 a b c <NA> <NA> <NA> a b c a b c <NA> <NA> <NA> x2b <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> e f g x3 -1.0193006 -2.0372846 -0.6502925 -1.0193006 -2.0372846 -0.6502925 -1.0193006 -2.0372846 -0.6502925 -1.0193006 -2.0372846 -0.6502925 -1.0193006 -2.0372846 -0.6502925 x4 -0.8175212 -1.0685405 0.7338066 -0.8175212 -1.0685405 0.7338066 -0.8175212 -1.0685405 0.7338066 -0.8175212 -1.0685405 0.7338066 -0.8175212 -1.0685405 0.7338066 x5 -0.3094028 -1.0913312 0.7393544 -0.3094028 -1.0913312 0.7393544 -0.3094028 -1.0913312 0.7393544 -0.3094028 -1.0913312 0.7393544 -0.3094028 -1.0913312 0.7393544

6|Page

Merge Rows of a Data Set Method 1 library(plyr) [sum by id variables] combine rows ddply(data.frame, .(other.facs), summarize, combined.fac = sum(combined.fac)) dataframe = dataframe combined.facs = numeric factors you want to sum (or other operation) other.facs = list of factors that are repeated in all rows for the combined.facs Merge Rows of a Data Set Method 2 library(data.table) [sum by id variables] combine rows dataframe[ , list(combined.facs=sum(combined.facs)), list(other.facs)] dataframe = dataframe combined.fac = numeric factor you want to sum (or other operation) other.facs = list of factors that are repeated in all rows for the combined.facs
EXAMPLE (dat <- structure(list(year = structure(c(1L, 1L, 1L, 1L, 1L, 1L),.Label = "base", class = "factor"), age = structure(c(1L, 2L, 2L, 3L, 3L, 4L), .Label = c("0", "1", "2", "3"), class = "factor"), pop = c(98378, 104648, 96769, 92448, 100745, 116926), FIPS = structure(c(1L, 1L, 1L, 1L, 1L, 1L), .Label = "6001", class = "factor")), .Names = c("year", "age", "pop", "FIPS"), row.names = c(NA, -6L), class = c("data.table", "data.frame"), sorted = c("year", "age", "FIPS"))) library(data.table) #Method 1 dat[,list(pop=sum(pop)),list(year,age,FIPS)] library(plyr) #Method 2 ddply(dat, .(year, age, FIPS), summarize, pop = sum(pop)) dat<-data.frame(dat, "x2"=sample(1:100, nrow(dat))) dat$x2<-as.numeric(dat$x2) ddply(dat, .(year, age, FIPS), summarize, pop = sum(pop), x2=sum(x2))

Extra Combined factors

Paste two data frames together


a<-mtcars[1:3,1:3] b<-mtcars[1:3,8:10] mypaste <- function(x,y) paste(x, "(", y, ")", sep="") mapply(mypaste, a,b)

7|Page

Multi Merge
#Create three dataframe Week_1_sheet <- read.table(text="ID Gender DOB Absences Unexcused_Absences Lates 1 1 M 1997 5 1 14 2 2 F 1998 4 2 3", header=TRUE) Week_2_sheet <- read.table(text="ID Gender DOB Absences Unexcused_Absences Lates 1 1 M 1997 2 1 10 2 2 F 1998 8 2 2 3 3 M 1998 8 2 2", header=TRUE) Week_3_sheet <- read.table(text="ID Gender DOB Absences Unexcused_Absences Lates 1 1 M 1997 2 1 10 2 2 F 1998 8 2 2", header=TRUE) #Consolidate them into a list WEEKlist <- list(Week_1_sheet , Week_2_sheet , Week_3_sheet) #transform common variables before the merge lapply(seq_along(WEEKlist), function(x) { WEEKlist[[x]] <<- transform(WEEKlist[[x]], Absences=sum(Absences, Unexcused_Absences))[, -5] } ) #notice the assignment to the enviroment #change names of columns that may overlap with other data frame yet not have duplicate data lapply(seq_along(WEEKlist), function(x) { y <- names(WEEKlist[[x]]) #do this to avoid repeating this 3 times names(WEEKlist[[x]]) <<- c(y[1:3], paste(y[4:length(y)], ".", x, sep=""))} ) #notice the assignment to the enviroment #Method using a for loop DF <- WEEKlist[[1]][, 1:3] for ( .df in WEEKlist) { DF <-merge(DF,.df,by=c('ID', 'Gender', 'DOB'), all=T, suffixes=c("", "")) } DF #Method using Reduce merge.all <- function(frames, by) { return (Reduce(function(x, y) {merge(x, }

y, by = by, all = TRUE)}, frames))

merge.all(frames=WEEKlist, by=c('ID', 'Gender', 'DOB')) merge.all(frames=WEEKlist, by=1:3)

#BENCHMARKING require(rbenchmark) benchmark( LOOP={DF <- WEEKlist[[1]] for ( .df in WEEKlist) { DF <-merge(DF,.df,by=c('ID', 'Gender', 'DOB'), all=T, suffixes=c("", "")) }}, REDUCE=merge.all(frames=WEEKlist, by=c('ID', 'Gender', 'DOB')), columns = c( "test", "replications", "elapsed", "relative", "user.self", "sys.self"), order = "test", replications = 1000, environment = parent.frame()) test replications elapsed relative user.self sys.self 1 LOOP 1000 10.12 1.62701 7.89 0 2 REDUCE 1000 6.22 1.00000 5.34 0

8|Page

Exporting an output to a file (method 1) saving a file save file cat(object,file="name.doc", sep = " ", append = FALSE) Append means add to the existing file (TRUE) or overwrite the file (FALSE)
EXAMPLE xc<-pi*3^2 cat(xc,file="xREPORT.doc") xc2<-((xc+23)/4)-1000 cat(xc2,file="xREPORT.doc",append=T) unlink("xREPORT.doc")

Exporting an output to a file (method 2) saving a file save file write(x, file="data", ncolumns=if(is.character(x)) 1 else 5, append=FALSE) Append means add to the existing file (TRUE) or overwrite the file (FALSE)
xc<-pi*3^2 cat(xc,file="xREPORT.doc") xc2<-((xc+23)/4)-1000 write(xc2,file="xREPORT.doc",append=T) unlink("xREPORT.doc")

Exporting an output to a file (method 3) saving a file save file This method prints all the results directly to the file without naming an object to print. sink (file="name.doc", append = FALSE, split = FALSE) Append means add to the existing file (TRUE) or overwrite the file (FALSE) Split sends it to the file and the command line. see also: capture.output()
EXAMPLE sink("example.doc",append=FALSE,split=TRUE)#append=T if file already exists mod<-lm(mpg~disp*cyl,data=mtcars) anova(mod) summary(mod) cat("The dog ate the food on ",date(),".\n",sep="") sink()#turns sink off

Saving R Objects Save objects and load same objects into R save(, file = "foo.RData") load("foo.RData") Save objects and load them back in by assigning them to a new object saveRDS(mod, "mymodel.rds") <- readRDS("mymodel.rds") 9|Page

Delete a File from within [R] unlink("file") or file.remove("file")

EXAMPLE STRING<-"TEST" cat(STRING, file = ".TEST.txt") unlink(".TEST.txt")

Checking What Files Are in a directory (default is working directory) list.files()

10 | P a g e

Checking the Data Set Simply type the name of the data set (data frame) and hit enter (in the example above we called the data set myData). Look @ beginning or end of a data set head() tail() dataset[1:n] or dataset[c(3,4,5,6,100,101,102,200),] The psych() package also includes a quick way to show the first and last n lines of a data.frame, matrix, or a text object. headtail(x,hlength=4,tlength=4,digits=2) Arguments x A matrix or data frame or free text hlength The number of lines at the beginning to show tlength The number of lines at the end to show digits Round off the data to digits Quick way to attach/detach variables Type: attach(myData)/detach(myData) Where the attach is the command and the myData is the imported file (data set). Now when you type the column headers you are expressing the variable name. Preferred method for attaching data to a function/expression with(data, expr, ...)

Looking at a Variable (column) from the Data Set (data frame) Type: myData$Day Note: the myData and the Day portion are alterable; myData is your data set (data frame) and the Day portion is the name of the variable in the data set. This shows the vector for that variable.

11 | P a g e

Manual data Entry x<-c(3,4,3,5,2,3,4,3) or x<-scan() This is a line by line entry system that has the feel of Excel entry. When you come to the end of your data press enter. Look at stored variables/objects ls() objects() browseEnv() ls.str()

> x<-scan() 1: 21 2: 2 3: 3 4: 4 5: 5 6: 67 7: 776 8: 565 9: 45 10: 87 11: 567 12: 54 13: 34 14: 34 15: 32 16: Read 15 items > mean(x) [1] 153.0667

#see all objects in environment in R console #see all objects in environment in R console #see all objects in environment in web browser #gives all the stored objects in the workspace plus some info on each one

Remove all stored variables/objects (see also: Reduce Objects and Junk in Memory)
rm(list = ls(all = TRUE))

ls() to check if the variables have been reset output will be character(0) Remove everything except for functions
rm(list=setdiff(ls(all.names=TRUE), lsf.str(all.names=TRUE)))

Searching the Objects in List


term="b" ls(pattern=paste("^",term,sep="")) ls(pattern=paste(term,sep=""))

Hiding objects in workspace


EXAMPLE .BB<-"You can't see me!" ls() .BB rm(.BB) #to remove the object .BB

Name the object beginning with a period and it hides the object from the working directory

12 | P a g e

Checking the Data Missing Values or Missing Data Finding Missing Values type: NAfun() for a list of NA functions Ive created

Good implementations that can be accessed through R include Amelia II, Mice, and mitools.

Functions to omit observations with missing values (listwise deletion) na.fail(object, ...) na.omit(object, ...) na.exclude(object, ...) na.pass(object, ...)
If na.omit removes cases, the row numbers of the cases form the"na.action" attribute of the result, of class "omit". na.exclude differs from na.omit only in the class of the "na.action" attribute of the result, which is "exclude". This gives different behaviour in functions making use of naresid and napredict: when na.exclude is used the residuals and predictions are padded to the correct length by inserting NAs for cases omitted by na.exclude. Impute means or median for missing impute(x, what = c("median", "mean")) EXAMPLE lk<-c(3,4,5,6,NA,3,4,5,6) jk<-(.4,NA,.5,.3,.4,.3,NA,NA,.8) das<-data.frame(lk,jk) sapply(na.omit(das),mean) sapply(na.omit(das),median) das impute(das, what = c("median")) impute(das, what = c( "mean")) library(e1071) Not a preferred method

Replace Missing Values With A given Value See just for certain columns method below variable[is.na(variable)]<- # you want to impute EXAMPLE:
mtcars2<-mtcars mtcars2$carb<-with(mtcars2,replace(carb, carb==1,NA)) mtcars2 mtcars2$carb[is.na(mtcars2$carb)]<-1000 #1000 could be 0 or mode etc. mtcars2see also: replace()

13 | P a g e

Replace Missing Values With A given Value for Selected Columns EXAMPLE
A<-c(NA,5,4,7,3,NA,NA) B<-c(.1,.4,.5,NA,NA,.3,.2) C<-c(30,NA,40,40,60,50,70) DF<-data.frame(A,B,C) DF2<-DF #this is just so we can reset DF cols <- c(2,3) #select the columns you want to impute with 0's DF[,cols][is.na(DF[,cols])] <- 0 DF cols <- c(2,3) #select the columns you don't want to impute with 0's DF2[,-cols][is.na(DF2[,-cols])] <- 0 DF2

Replace Missing Values With Means by Group impute.mean <- function(x) replace(x, is.na(x), mean(x, na.rm = TRUE)) #generic function dat2 <- ddply(dataframe, ~ group.var, transform, new.var{or.replace old} = impute.mean(var))

dat <- read.table(text = "id taxa length 101 collembola 2.1 0.9 102 mite 0.9 0.7 103 mite 1.1 0.8 104 collembola NA NA 105 collembola 1.5 0.5 106 mite NA NA", header=TRUE)

width

library(plyr) impute.mean <- function(x) replace(x, is.na(x), mean(x, na.rm = TRUE)) dat2 <- ddply(dat, ~ taxa, transform, length = impute.mean(length), width = impute.mean(width)) dat2[order(dat2$id), ] > dat id taxa length width 1 101 collembola 2.1 0.9 2 102 mite 0.9 0.7 3 103 mite 1.1 0.8 4 104 collembola NA NA 5 105 collembola 1.5 0.5 6 106 mite NA NA > dat2[order(dat2$id), ] id taxa length width 1 101 collembola 2.1 0.90 4 102 mite 0.9 0.70 5 103 mite 1.1 0.80 2 104 collembola 1.8 0.70 3 105 collembola 1.5 0.50 6 106 mite 1.0 0.75

14 | P a g e

Generic Replace Missing Values impute <- function(x, fun) { missing <- is.na(x) replace(x, missing, fun(x[!missing])) } ddply(dataframe, ~group, transform, length = impute(length, function))

impute <- function(x, fun) { missing <- is.na(x) replace(x, missing, fun(x[!missing])) } ddply(dat, ~ taxa, transform, length width = impute(width, mean)) ddply(dat, ~ taxa, transform, length width = impute(width, median)) ddply(dat, ~ taxa, transform, length width = impute(width, min)) con = function(x) 100 ddply(dat, ~ taxa, transform, length width = impute(width, con)) = impute(length, mean), = impute(length, median), = impute(length, min), = impute(length, con),

15 | P a g e

Create a subset of data with missing values removed per variable Example:
A<-c(NA,2:6) B<-c(11:15,NA) C<-c(NA,3,NA,5,NA,9) (DF<-data.frame(A,B,C)) with(DF,which(is.na(B))) with(DF,which(!is.na(B))) DF[with(DF,which(is.na(B))),] DF[with(DF,which(!is.na(B))),]

Output:
> > > >

A<-c(NA,2:6) B<-c(11:15,NA) C<-c(NA,3,NA,5,NA,9) (DF<-data.frame(A,B,C)) A B C 1 NA 11 NA 2 2 12 3 3 3 13 NA 4 4 14 5 5 5 15 NA 6 6 NA 9 > with(DF,which(is.na(B))) [1] 6 > with(DF,which(!is.na(B))) [1] 1 2 3 4 5 > DF[with(DF,which(is.na(B))),] A B C 6 6 NA 9 > DF[with(DF,which(!is.na(B))),] A B C 1 NA 11 NA 2 2 12 3 3 3 13 NA 4 4 14 5 5 5 15 NA

#LOOK AT: Library(Hmisc) aregImpute()

16 | P a g e

Assumption Testing Function for Assessing Assumptions library(gvlma) gvlma(lm(model))


EXAMPLE library(gvlma) (gvmodel <- with(mtcars,gvlma(lm(mpg~disp*hp*cyl)))) summary(gvmodel) multiG(27,11,6,2,c(1:12)) plot(gvmodel,onepage=F)

Normality Assumption (remember the assumption is usually normality of residuals)


#=============================================================================== # LOADING THE LIBRARIES USED #=============================================================================== library(MASS);library(nortest);library(fBasics);library(psych);library(timeDate) #=============================================================================== # GENERATING SOME DATA #=============================================================================== x.norm<-rnorm(n=200,m=10,sd=2) #=============================================================================== #LOOKING AT THE GRAPHS (remember somewhat manipulated by the scale you choose) #=============================================================================== par(mfrow=c(3,1)) h<-hist(x.norm,main="Histogram of observed data w/ normal curve",col="red") xfit<-seq(min(x.norm),max(x.norm),length=40) yfit<-dnorm(xfit,mean=mean(x.norm),sd=sd(x.norm)) yfit <- yfit*diff(h$mids[1:2])*length(x.norm) lines(xfit, yfit, col="blue", lwd=2) plot(density(x.norm),main="Density estimate of data") polygon(density(x.norm) ,col="green", border="blue") truehist(x.norm,main="True Histogram of observed data") #=============================================================================== #LOOKING AT THE QQ PLOTS (very effective approach) #=============================================================================== win.graph() par(mfrow=c(1,2)) qqnorm(x.norm, col="red") qqline(x.norm) qqnormPlot(x.norm) #=============================================================================== #STATISTICAL TESTS OF NORMALITY (vary greatly; be cautious; p>.05 = normal) #=============================================================================== ksnormTest(x.norm)#Kolmogorov-Smirnov (for large sample) normality test shapiro.test(x.norm) #Shapiro-Wilks (for small samples)test for normality shapiroTest(x.norm) #Shapiro-Wilks test for normality jarqueberaTest(x.norm) #JarqueBera test for normality dagoTest(x.norm) #DAgostino normality test adTest(x.norm) #AndersonDarling normality test cvmTest(x.norm) #Cramervon Mises normality test lillieTest(x.norm) #Lilliefors (Kolmogorov-Smirnov) pchiTest(x.norm) #Pearson chisquare normality test sfTest(x.norm) #ShapiroFrancia normality test kurtosis(x.norm,type=1) #type 1 biased; type 2 unbiased kurtosis(x.norm,type=2) #excess selected = moment method or -3 (0 is normal) kurtosi(x.norm);library(e1071) kurtosis(x.norm, type=1);kurtosis(x.norm, type=2);kurtosis(x.norm, type=3) skewness(x.norm, type=1);skewness(x.norm, type=2);skewness(x.norm, type=3) skew(x.norm);win.graph();par(mfrow=c(1,2)) mardia(x.norm) #for multivariate data probplot(x.norm) #

17 | P a g e

Addressing Non-Normaility

Assumptions Script Folder has several skew, kurtosis checkers.

One Function to Conduct Multiple Tests of Normality


source("C:/Users/Rinker/Desktop/PhD Program/CEP 523-Stat Meth Ed Inference/R Stuff/Scripts/Assumption Testing/Tests of Normality.txt")

Info on Normality from Andy Fields [CNTRL +CLICK HERE]

Transforming Skew (for positive skew) 3 Methods: 1. Log Transformation

Graph example log trans

log10(Xi)

mod1<-lm(mpg~disp,data=mtcars) mod2<-lm(log(mpg)~disp,data=mtcars) windows(h=6,w=12) par(mfrow=c(1,2)) with(mtcars,plot(mpg~disp,main="Raw Plot")) abline(reg=mod1,lty=3,col="green") with(mtcars,plot(log(mpg)~disp,main="Log Transformation")) abline(reg=mod2,lty=3,col="blue") mod1;mod2

You cant log 0 so if your data has this you must add a constant to adjust the data 2. Square Root Transformation sqrt(Xi)

You cant sqrt() <0 so if your data has -# you must add a constant to adjust the data 3. Reciprocal Transformation Reverse scored (Xhighest score and the small scores big. 1/( Xhighest score Xi)

Xi) to overcome the effect of inverse making the big scores small

All of these transformations can be done to negative skew as well but the data must be reverse scored (Xhighest score Xi) first to reverse the skew. REMEMBER: Transform one numeric variable, transform all of them. Fixing & Transforming Kurtosis 1. First Check for outliers and, if possible, delete any more than SD from the regression line 2. Square Transformation (Xi)^2 Function for normalizing Data
uniformDAT(x)

Source Path source("C:/Users/Rinker/Desktop/PhD Program/CEP 523-Stat Meth Ed Inference/R Stuff/Scripts/Assumption Testing/Normal Tansformation.txt

# NORMAL TRANSFORMATION FUNCTION CODE WITH EXAMPLE #=============================================================== uniformDAT <- function (x) { x <- rank(x, na.last = "keep", ties.method = "average") n <- sum(!is.na(x)) x / (n + 1) } normalize <- function (x) { qnorm(uniformize(x)) }
#=============================================================== # THE DATA #=============================================================== par(mfrow=c(3,2)) x1<-sample(1:100,100, replace=T) x2<-ifelse(x1<21,x1+79,x1) x3<-ifelse(x1>80,x1-79,x1) #=============================================================== # WHAT THE FUNCTION DOES GRAPHICALLY #=============================================================== f(x1, FUN = normalize, main = "Distribution 1") f(x2, FUN = normalize, main = "Right Skewed distribution") f(x3, FUN = normalize, main = "Left Skewed distribution") #=============================================================== # LOOKING AT THE DATA #=============================================================== list("x1 data before"=x1,"x1 data after"=uniformDAT(x1), "left skewed data before"=x2, "left skewed data after"=uniformDAT(x2), "right skewed data before"=x3, "right skewed data after"=uniformDAT(x3))

18 | P a g e

Homogeneity Assumtion Equal variance of 2 populations var.test(x,y)


EXAMPLE
#================================================================================= # TESTING IF TWO SAMPLE VARIANCES ARE EQUAL #================================================================================= # GENERATING THE DATA x1<-rnorm(1:1000,100) y1<-rnorm(1:1000,100) #================================================================================= # MEAN AND STANDARD DEVIATION #================================================================================= descriptives<-data.frame(c(mean(x1),sd(x1)),c(mean(y1),sd(y1))) colnames(descriptives)<-c("x1","y1");rownames(descriptives)<-c("mean","sd") descriptives #================================================================================= # GRAPH THE DATA #================================================================================= par(mfrow=c(2,1));library(descr) histkdnc(x1,main="x1");histkdnc(y1,main="y1") #================================================================================= # TESTING EQUAL VARIANCES; function--> var.test() #================================================================================= list(" NOTE: p > .05; not significantly different"=var.test(x1,y1,alternative="two.sided",confi.level=.95))

Equal variance of groups bartlett.test (numeric variable, grouping factor) and levene.test(numeric variable, grouping factor)
EXAMPLE #================================================================================= # TESTING IF TWO SAMPLE VARIANCES ARE EQUAL #================================================================================= # GENERATING THE DATA rannum<-c(sample(1:5,1000,replace=T)) factor<-c(recodeVar(rannum,src=c(1,2,3,4,5), tgt=c("blue","black","red","green","orange"), default=NULL, keep.na=TRUE)) dep.var<-rnorm(1000) color.df<-data.frame(factor,dep.var);tail(color.df) #================================================================================= # MEAN AND STANDARD DEVIATION #================================================================================= library(doBy) summaryBy(dep.var~factor, data = color.df, FUN = function(x) { c(n = length(x),mean = mean(x), sd = sd(x)) } ) #================================================================================= # GRAPH THE DATA #================================================================================= black<-subset(color.df,factor=="black");blue<-subset(color.df,factor=="blue") green<-subset(color.df,factor=="green");orange<-subset(color.df,factor=="orange") red<-subset(color.df,factor=="red");par(mfrow=c(3,2));library(descr) histkdnc(dep.var,main="Overall");histkdnc(black$dep.var,main="black",col="black") histkdnc(blue$dep.var,main="blue",col="blue");histkdnc(green$dep.var,main="green",col="green") histkdnc(orange$dep.var,main="orange",col="orange");histkdnc(red$dep.var,main="red",col="red") #================================================================================= # TESTING EQUAL VARIANCES; function--> var.test() #================================================================================= library(lawstat) list("---------------------------------------------------------------------------------Levene's Test"=levene.test(dep.var, factor,location= "mean"),"--------------------------------------------------------------------------------"=bartlett.test(dep.var, factor))

19 | P a g e

Equal variance of groups less sensitive to outliers fligner.test(x, ...) fligner.test(x, g, ...) fligner.test(formula, data, subset, na.action, ...) fligner.test(list(group a,group b,group c,groupn)) Arguments x a numeric vector of data values, or a list of numeric data vectors.
g formula

a vector or factor object giving the group for the corresponding elements of x. Ignored if x is a list. a formula of the form lhs ~ rhs where lhs gives the data values and rhs the corresponding groups.

20 | P a g e

Sphericity (Assumption of Repeated Measures Test)


Sphericity- is, in a nutshell, that the variances of the differences between the repeated measurements should be about the same.

mauchly.test
Greenhouse-Geisser

Outliers library(mvoutlier) & library(outlier) ?influence.measures

21 | P a g e

Create Sequences of Integers Input 1:5 Output 1 2 3 4 5 Create Sequences of Real Numbers Input seq(from=3,to=7,by=.5) NOTE: would do seq(3,7,.5)too Output 3.0 3.5 4.0 4.5 5.0 5.5 6.0 6.5 7.0 from, to: by: the starting and (maximal) end value of the sequence. increment of the sequence. (leave this out is the same as n:m)

length.out: desired length of the sequence. A non-negative number, which for seq and seq.int will be rounded up if fractional. Repeat Integer Pattern rep(pattern, times=) rep(pattern, times=,each=)
EXAMPLE rep(1:2, times=1,each=25) rep(1:2, times=25) #================================================================= > rep(1:2, times=25) [1] 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 [39] 1 2 1 2 1 2 1 2 1 2 1 2 > rep(1:2, times=1,each=25) [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 [39] 2 2 2 2 2 2 2 2 2 2 2 2

term: replicate

Search Patterns Wthin Vectors spattern rel(x) inverse.rle(rle(x))


x <- rev(rep(6:10, 1:5)) rle(x) inverse.rle(rle(x)) z <- c(TRUE,TRUE,FALSE,FALSE,TRUE,FALSE,TRUE,TRUE,TRUE) rle(z) inverse.rle(rle(z))

22 | P a g e

Generating Random Numbers, Integers and Categorical Variables (random sample) Random Normal rnorm(n=, m=,sd=) Where n is amount of samples Random Normal Between Certain Values 1 x<- rnorm(n=500, m=42, sd=10) x <- x[x>=30 & x <=50] Random Normal Between Certain Values 2(doesnt throw away samples)
n <- 1000 #samples desired L <- .2 #lower limit U <- .8 #upper limit m <- 1 #mean s <- 1 #sd x <- qnorm(runif(n, pnorm(L, mean=m, sd=s), pnorm(U, mean=m, sd=s)), mean=m, sd=s) x

Random Integers sample(seq,n, replace=T) Where seq is a sequence such as 10:80, and n is amount of samples Random Integers 2 sample.int(x,n, replace=T) x= all integers that value

EXAMPLE sample.int(2,10,replace=T)#flip a coin 10 times sample.int(6,7,replace=T)#roll a die 7 times sample.int(52,5,replace=T)#pick a card 5 times

ALLOW REPRODUCIBLE RANDOM NUMBERS NOTE: you can use set.seed() to enable someone else to reproduce exactly the same
set.seed(15)# allow reproducible random numbers sample(1:2, size=5, replace=TRUE) sample(1:2, size=5, replace=TRUE)

Random Categorical Vector (See generate factor below) sample(categorical.vector,n,replace=T)


colors<-c(sample(c("blue","red","green","orange"),10,replace=T)) hue<-abs(rnorm(10)) colorsDF<-data.frame(colors,hue) #colors is the random creation of a categorical variable colors;hue;colorsDF

Example

Note: the sample() with the recodeVar() from the doBy library could also be used for generating a random character vector. (See also relevel which is more efficient)
library(doBy) recodeVar(sample(1:5,25,replace=T), src=c(1,2,3,4,5), tgt=c("a","e","i","o","u"), default=NULL, keep.na=TRUE) [1] "u" "a" "o" "u" "e" "i" "i" "i" "u" "u" "o" "o" "i" "e" "u" "u" "u" "o" "a" "a" "u" "e" "o" "a" "u"

23 | P a g e

Generate Factor (non random) gl(n, k, length = n*k, labels = 1:n, ordered = FALSE)
ARGUMENTS

EXAMPLE gl(3, 5, length=100,labels = c("Control", "Treat","Died"))

Convert scores to z-scores scale(vector) To function to the right is an example of how the z scores normalizes the data. The code creates a vector of random integers and then coverts it to a z-score vector. There is also a comparison of both vectors with histograms.
SCALEfun<-function(){ ay<-sample(1:100, 25, replace=T) az<-c(scale(ay, center = TRUE, scale = TRUE)) par(mfrow=c(2,1));library(descr) histkdnc(ay,main="Before z score Transformation",col="red") histkdnc(az,main="After z score Transformation",col="green") list(ay,az) }

Probability calculation computes the combinations choose(n,k) EXAMPLE:


choose(54,5) 1/choose(54,5)

Generate all the possible outcomes of a vector given each element is used only once combn(x,m) vector source for combinations, or integer n for x <- seq_len(n). m number of elements to choose
x EXAMP_LES combn(letters[1:4], 2) combn(LETTERS[1:10], 9) combn(0:10,10)

24 | P a g e

Generate All Possible Outcomes For the outside of a matrix outer(vector 1,vector 2,FUN=) EXAMPLES
outer(month.abb, 1999:2003, FUN = "paste") data.frame(outer(c("R","r"), c("R","r"), FUN = "paste"))#Punnets Square outer(c("H","T"), 1:6, FUN = "paste")#outcomes of flipping coin and rolling a die outer(LETTERS[1:10], 0:9, FUN = "paste") outer(0:9, 0:9, FUN = "*") #multiplication table 0-9 outer(0:20, 0:20, FUN = "*") #multiplication table 0-20 outer(0:9, 1:9, FUN = "/") #division table 0-9 outer(0:9, 0:9, FUN = "^") #exponential table 0-9 outer(0:9, 0:9, FUN = "-") #exponential table 0-9 outer(0:9, 0:9, FUN = "+") #exponential table 0-9

Generate All Possible Combinations for a List of Factors


expand.grid(factor.name1=c("factor levels"), factor.name2=c("factor levels"), factor.namen=c("factor levels"))
EXAMPLE expand.grid(age=c(4:10),academic.level=c("high","med","low"),sex=c("male","female"))

List Prime Numbers library(matlab) primes(n) Perform Prime Factorization library(matlab) factors(n) Create Magic Squares library(matlab) magic(n) Generate a list of Dummy Codes for a Factor model.matrix(~factor-1)
EXAMPLE: (iris.dummy<-with(iris,model.matrix(~Species-1))) (IRIS<-data.frame(iris,iris.dummy))

25 | P a g e

Data Manipulation sround sformat can be used to round integers as well Changing digits options(digits=10) (x<-pi*12345) options(digits=#) #This is a global change round(x,-4:4) #negative rounds the integer print(x,digits=#) #Local change cat(format(x,digits=#)) #Local change for functions round(x, digits=#) #Rounds x to an integer SEE Cut Points for an example round data frame signif(x, digits=#) #Scientific Notation
options(scipen=99) #Eliminates scientific notation (Global Change)

format(x,) #SEE BELOW FOR MORE ABOUT FORMAT sprintf("%.49f", (1+sqrt(5))/2) Force rounding with a certain number of digits sprintf("%.49f", pi) library(mpc); mpc(1, 3000) / mpc(998001, 3000) Format (ENABLES digits, format(x,) Arguments
x trim digits any R object (conceptually); typically numeric. logical; if FALSE, logical, numeric and complex values are right-justified to a common width: if TRUE the leading blanks for justification are suppressed. how many significant digits are to be used for numeric and complex x. The default, NULL, uses getOption(digits). This is a suggestion: enough decimal places will be used so that the smallest (in magnitude) number has this many significant digits, and also to satisfy nsmall. (For the interpretation for complex numbers see signif.) the minimum number of digits to the right of the decimal point in formatting real/complex numbers in nonscientific formats. Allowed values are 0 <= nsmall <= 20. should a character vector be left-justified (the default), right-justified, centred or left alone. default method: the minimum field width or NULL or 0 for no restriction. AsIs method: the maximum field width for non-character objects. NULL corresponds to the default 12. logical: should NA strings be encoded? Note this only applies to elements of character vectors, not to numerical or logical NAs, which are always encoded as "NA". Either a logical specifying whether elements of a real or complex vector should be encoded in scientific format, or an integer penalty (see options("scipen")). Missing values correspond to the current default penalty. further arguments passed to or from other methods.

nsmall justify width na.encode scientific ...

Round to a nearst fractional


x <- c(4.2, 4.3, 4.8) #Method 1 (generalizes to any rounding) library(plyr) round_any(x, 3) round_any(x, 1) round_any(x, 0.5) round_any(x, 0.2) #Method 2 (nearest half) round(x*2)/2 #Method 3 (nearest half) round(x/5, 1)*5

26 | P a g e

Minimal Specifications R looks for objects with in the environment you specify that minimally meet your requirements: Indexing dataframe$object or dataframe[,"object"] Example
CO2$T #Both Treatment and Type fit this so R returns NULL CO2$Ty CO2$P

Determine number of observations in a dataframe nrow(dataframe) Determine number of variables in a dataframe ncol(dataframe) Determine number levels of a factor number of levels nlevels(factor) Look At beginning or end of a data frame head(dataframe) dataframe[1:10,] I compiled this into the function HEAD() in .First tail(dataframe) dataframe[(nrow(dataframe)-10):(nrow(dataframe)),] Compiled as function TAIL() in .First begend(dataframe) Looks at first 5 and last 5 observations of a dataframe Locating the info for a single row (observation) Type: data[10,] Output

Where data is your data frame (set) and the 10 is the tenth observation. Changing a numeric variable by a constant Type: b<-sa*.45 Output

Where b is the new variable vector name, sa is the original numeric variable, and .45 is the constant. This could be very useful for creating new combined variables:

27 | P a g e

List all the variables created Type: ls() or objects() Output

Look at all the code and commands youve typed in a session in a new window history(number) Sys.setenv(R_HISTSIZE=10000) #increase the 512 line limit even more Pull up the last value A quasi constant in R in that it is a non function that takes the value of the last input .Last.value EXAMPLE
takes.a.while <- function(){ Sys.sleep(10) rnorm(20) } takes.a.while() # Oh no I forgot to assign it to a variable lifesaver <- .Last.value lifesaver

Break a data frame into Groups of a factor split(df,factor) Creates a list of data frames by the Groups of the Chosen Factor
EXAMPLE warpbreaks (groups<-with(warpbreaks,split(warpbreaks,tension))) #method 1 with(groups,lapply(groups,mean)) with(groups,lapply(groups,nrow)) with(groups,lapply(groups,sd))

28 | P a g e

Creating a subset of data (useful for control vs. treatment groups) Using your original data set type: gg<-subset(a,g=="f", select=NULL)

Also see select cases below

subset(mtcars,select = mpg:vs) subset(airquality, Temp > 80, select = c(Ozone, Temp)) subset(airquality, Day == 1, select = -Temp) subset(airquality, select = Ozone:Wind)

Where gg is the name of the new data subset, subset is the function, a is the large data set name, g is the variable you wish to make a subset around f is the level you want to isolate to make a new data set of, and select are the columns you want. Original data set New subset

Remember to rename the variables for each data set in the same way you did with the larger data set (lets say we have a female and male subset for example). This can be done for any number of variables in the subset, enabling tests on the subgroups.

The summary is a great follow up to generate some quick and useful information about each subset:

Type: summary(gg) 29 | P a g e

Select certain rows or columns by criteria Other ways to subset


mtcars2<-mtcars mtcars2[,c("mpg","disp","wt")] #select some columns meth. 1 subset(mtcars2,select=c(mpg,disp,wt)) #select some columns meth. 2 #Subset really fine tunes the selection: subset(mtcars2,select=c(mpg,disp,wt),subset=c(mpg>25&wt<=4)) subset(mtcars2,select=c(mpg,disp,wt),subset=c(mpg>25&wt<=4|disp>=400))
#======================================================== #select mpg = 21 with(mtcars,mtcars[mpg==21,]) #======================================================== #select mpg greater than 30 with(mtcars,mtcars[mpg>=30,]) #======================================================== #select mpg greater than 30 and disp less than 80 with(mtcars,mtcars[mpg>=30&disp<80,]) #======================================================== #select mpg greater than median mpg and disp over 110 with(mtcars,mtcars[mpg>=median(mpg)&disp>110,]) #======================================================== mtcars.8cyl<-mtcars[cyl==8,] mtcars.8cyl<-mtcars.8cyl[-c(2)] mtcars.8cyl #this is the same as: subset(mtcars[-2],cyl==8)

Examples

Examples

slogical
LOGICAL OPERATORS & | ! COMPARISON OPERATORS == >= <= != Value Matching
%in%
mtcars[cyl==8|cyl==6,] mtcars[cyl==8|gear>=5,] mtcars[cyl==8&mpg>=18,] mtcars[cyl==8&mpg>=18|wt>=3.5,] mtcars[cyl==8&mpg>=18&wt>=3.5,] mtcars[cyl==8|cyl==6,] subset(mtcars, (cyl %in% c(6, 8))) subset(mtcars, !(cyl %in% c(6, 8))) subset(CO2, !(Plant %in% c("Qn1", "Mc3", "Mc1", "Mn2")))

Select columns from a data frame that are just numeric or just factors mtcars2[,sapply(mtcars2,is.numeric)] #same as NUM(df) in usefule functions mtcars2[,sapply(mtcars2,is.factor)] #same as FAC(df) in usefule functions NAs Please also see Missing Values for created functions and specific handling of NAs Create a Subset of non NA for just one column/variable dataframe2<-subset(dataframe, factor!=is.na(factor)) EXAMPLE:
mtcars2<-mtcars mtcars2$carb<-with(mtcars2,replace(carb, carb==1,NA)) mtcars2;paste("n is = to",length(mtcars2$mpg)) mtcars3<-subset(mtcars2, carb!=is.na(carb)) #Line above is the code (the rest is generating a dataset with NAs in it mtcars3;paste("n is = to",length(mtcars3$mpg))

30 | P a g e

Index a Single Columns Without Dropping the Variable Name drop name subset(dataframe, select=column.number) dataframe[, column.number, drop = FALSE] dataframe[column.number] or dataframe[column.name] #no comma
#EXAMPLES names(subset(mtcars, select=1)) names(mtcars[,1, drop = FALSE]) mtcars[1] or mtcars["mpg"]

Keep Column Names the Do Not Conform to R Standards data.frame('a b'=1, check.names=F) transform(data.frame('a b'=1, check.names=F), `c d`=`a b`, check.names=F) Split a Numeric Vector by a Categorical vector split(numeric,factor)
#================================================================ # CREATE THE DATA SET #================================================================ colors<-c(sample(c("blue","red","green","orange"),20,replace=T)) hue<-abs(rnorm(20)) colorsDF<-data.frame(colors,hue) #================================================================ # split(numeric.var,factor) #================================================================ with(colorsDF,split(hue,colors)) #================================================================ # USING SPLIT FOR MEANS AND SD etc. #================================================================ sapply(with(colorsDF,split(hue,colors)),mean) sapply(with(colorsDF,split(hue,colors)),sd) OUTPUT > with(colorsDF,split(hue,colors)) $blue [1] 0.0143338132 ,1.9922393211 ,0.7892910777 0.0004594093 $green [1] 0.5897572, 0.7480668, 2.8692182, 0.2506951 $orange [1] 1.3469976. 0.8757391. 1.4951192. 1.3781447 $red [1] 0.05447693, 0.10730018, 2.20397056,0.05800449, 1.84962318,0.15243645,0.29141207, 0.05877585 =============================================== USING SPLIT FOR MEANS AND SD etc. ================================================ > sapply(with(colorsDF,split(hue,colors)),mean) blue green orange red 0.6990809 1.1144343 1.2740002 0.5970000 > sapply(with(colorsDF,split(hue,colors)),sd) blue green orange red 0.9376117 1.1881110 0.2730569 0.8909689

31 | P a g e

Sorting & Ordering Observations #1 see also arrange() & orderBy() ssort sorder NOTE: You can use sort() for vectors x[order(x$B),] #sort a dataframe by the order of the elements in B x[rev(order(x$B)),] #sort the dataframe in reverse order with(mtcars,mtcars[ order(-cyl, gear, carb) ,]) #sort ascending and descending (use of - )
#EXAMPLE mtcars[1:10] #first 1-10 of data set #order ascending #rev descending mtcars[rev(order(mtcars$cyl)),][1:10] #Sort by cyl (descending) mtcars[order(mtcars$cyl),][1:10] #Sort by cyl (ascending) mtcars[order(mtcars$cyl,mtcars$vs),][1:10] #Sort by cyl then vs (asc.) mtcars[order(mtcars$cyl,mtcars$vs,mtcars$gear),][1:10] #Sort by cyl,vs, then gear( asc.)

Sorting #2 arrange(df,)

arrange()

MORE EFFICIENT THAN ORDER


library(plyr)

library(plyr)

(mtcars2<-data.frame(mtcars[1:16,],"grade"=c(rep(1:3,each=4,times=1),rep("k",4)))) levels(mtcars2$grade) <- c("4","3","2","1","k") arrange(mtcars2, -as.numeric(factor(grade,levels=c("k","1","2","3")),cyl,-disp)) arrange(mtcars, cyl, desc(disp))

Sorting #3 orderBy() I like this one the best orderBy(~formula, data=)


library(doBy) (mtcars2<-data.frame(mtcars[1:16,],"grade"=c(rep(1:3,each=4,times=1),rep("k",4)))) levels(mtcars2$grade) <- c("4","3","2","1","k") orderBy(~-grade + cyl, data=mtcars2) orderBy(~-grade + -disp + hp, data=mtcars2)

Duplicate Certain Rows of a Data Frame duplicate rows reprow(dataframe, column, value)
#THE FUNCTION reprow <- function(dataframe, column, value) { dataframe$ID <- 1:nrow(dataframe) DF <- data.frame(rbind(rbind(dataframe[which(dataframe[,column] %in% value), ], dataframe[which(dataframe[ ,column] %in% value), ]), rbind(dataframe[which(!dataframe[,column] %in% value), ]))) DF <- DF[order(DF$ID), ] DF$ID <- NULL rownames(DF) <- 1:nrow(DF) DF } #EXAMPLES reprow(mtcars, 'cyl', 4) #repeats any column with that value a second time reprow(mtcars, 'cyl', c(4, 6))

32 | P a g e

Replace (Go through data set and find values and replace them with a new value) replace(dataframe, list, values)

[recode; missing values ]srplace

#remember to assign this to some object i.e., x <- replace(dataframe,dataframe==-9,NA) #similar to the operation x[x==-9] <- NA This can also be done for just one variable in a data frame by saving the output as follows: dataframe$variable<- with(dataframe,replace(dataframe,factor==-9,NA)) EXAMPLE:
mtcars2<-mtcars mtcars2$carb<-with(mtcars2,replace(carb, carb==1,NA)); mtcars2 mtcars2<-mtcars #RESET mtcars2 mtcars2$carb[mtcars2$carb==4]<-NA; mtcars2 #Way 1

#Way 2

Remove values in a vector subset could also be used vector [!is.element(x,c(values to remove))]
Example x <- sample(0:20,100,replace=T) table(x) x2<-x[!is.element(x, c(0,9,20))] #removes the values 0,20 table(x2)

33 | P a g e

Rename a column (rename a variable) method 1 names(dataframe)[c(column#)] <- "new.name"


EXAMPLE names(mtcars)[names(mtcars)=='hp'] <-'sweet.new.hp' names(mtcars) names(mtcars)[3] <- "new.name" names(mtcars) names(mtcars)[c(2,5)] <- c("new.name2","new.name3") names(mtcars)

Rename a column (rename a variable) method 2 rename(data.frame, vector of name conversions)

library(reshape)

rename variable rename column

Example: rename(mtcars, c(wt = "weight", cyl = "cylinders"))

Rename a column (rename a variable) method 3 rename.vars(data, from="", to="", info=TRUE)

library(gdata)

variable rename column

EXAMPLE dat<-mtcars rename.vars(dat, from="mpg", to="new", info=TRUE) rename.vars(dat, from=c("wt","mpg"), to=c("new1","new2"), info=TRUE)

Finding duplicates rows in a data frame (see matching) unique(data.frame)


Arguments x a vector or a data frame or an array or NULL. incomparables a vector of values that cannot be compared. FALSE is a special value, meaning that all values can be compared, and may be the only value accepted
#EXAMPLE iris2<-data.frame(rbind(iris[1:15,],iris[1,],iris[3,])) iris2<-with(iris2,iris2[order(Sepal.Length),]) rownames(iris2)<-c(1:17);iris2 mess<-c("\bNOTE:\n", "\bThe unique() function searches for duplicates;\n", "\bnotice observation 6 & 14 are elimnated\n") unique(iris2,incomparables = FALSE) cat(mess)

Locating variables and values which(x==" ")


example mtcars.let<-data.frame("lets"=rep(letters[c(19:26)],4) ,mtcars) #making a dataframe with(mtcars.let,which(lets=="v")) #[1] 4 12 20 28 with(mtcars.let,which(cyl=="6")) [#1] 1 2 4 6 10 11 30

Finding Duplicate Entries in a Column duplicated(x)


EXAMPLE (IRIS<-iris[1:30,]) IRIS$Petal.Length[!duplicated(IRIS$Petal.Length)] IRIS$Petal.Length[duplicated(IRIS$Petal.Length)] which(duplicated(IRIS$Petal.Length)) which(!duplicated(IRIS$Petal.Length))

34 | P a g e

Finding Truly Unique Items in Vector (3 methods)


#DATA SET: x <- c(378, 380, 380, 380, 380, 360, 187, 380) #METHOD 1 [fastest and numeric/unsorted] setdiff(unique(x), x[duplicated(x)]) #METHOD 2 [medium speed & numeric/sorted] y <- rle(sort(x)); y[[2]][y[[1]]==1] #METHOD 3 [slowest/character/sorted] b <- table(x); names(b[b==1]) test replications elapsed relative user.self sys.self user.child sys.child 3 METHOD_1 1000 0.08 1.000 0.06 0 NA NA 2 METHOD_2 1000 0.25 3.125 0.23 0 NA NA 1 METHOD_3 1000 0.61 7.625 0.48 0 NA NA

Same idea applied to character data


set.seed(100) x <- sample(c("Certin", "features", "of", "the", "setting", "affected"), 13, replace=T) x hapax1 <- function(x) {x <- na.omit(tolower(x)); setdiff(unique(x), x[duplicated(x)])} hapax1(x) hapax2 <- function(x)names(table(tolower(x))[table(tolower(x))==1]) hapax2(x) hapax3 <- function(x) {y <- rle(sort(tolower(x))); y[[2]][y[[1]]==1]} hapax3(x)

35 | P a g e

Using Which to Find Even and Odd Numbers even odd which(vector%%2 == 1) which(vector%%2 == 0) #find odd #find even

#EXAMPLES with(mtcars,which(hp%%2 == 1)) with(mtcars,which(hp%%2 == 0)) with(mtcars,sapply(mtcars,even<-function(x){which(x%%2 == 0)}))#apply to data frame

Using TRUE/FALSE To find odd and even of objects Select every other of a vector object[c(T, F)] #odds object[c(F, T)] #evens
#EXAMPLES mtcars[c(T,F)] #every odd column mtcars[c(F,T)] #every even column mtcars[c(T,F), ] #odd row mtcars[c(F,T), ] #even row

Add column to a data set or an existing column (add variable) transform(data.set.name,new.var=(Science.Comprehension*10))


EXAMPLE airquality<-airquality[1:10,] transform(airquality, Ozone = -Ozone) transform(airquality, new = -Ozone, Temp = (Temp-32)/1.8) #Notice that if a variable is unknown a new variable is created attach(airquality) transform(Ozone, logOzone = log(Ozone))

Delete a variable from a data set

See also subset()

ddrop variable drop columnelete variable delete column

You could select all the columns you want and create a subset or:
POSITIVE SELECTION EXAMPLE: mtcars2<-mtcars mtcars2[,1:10] NEGATIVE SELECTION METHODS EXAMPLES: snegative indexing mtcars[, -which(names(mtcars) == "carb")] mtcars[, names(mtcars) != "carb"] mtcars[, !names(mtcars) %in% c("carb")] mtcars[, -match(c("carb"), names(mtcars))] mtcars2<-mtcars;mtcars2$hp <- NULL #-------------------------------library(gdata) #-------------------------------remove.vars(mtcars2, names="mpg", info=TRUE) remove.vars(mtcars2, names=c("wt","mpg"), info=TRUE)

Logic Testing & Coercion

test null test na test missing test fact test numeric is.character() is.vector() is.na() is.null()

is.numeric() is.factor() is.data.frame()

36 | P a g e

Logic Testing on a Whole Data Set str(dataset)


Examples: str(mtcars) str(CO2)

Method 2: TEST A WHOLE DATA SET WITH: library(gdata) is.what()


Example: library(gdata) sapply(mtcars,is.what)

Coerce the vectors using: as.numeric(y) as.factor(y) Matching smatching match(x, y)

as.data.frame(y)

as.character(y)

as.vector(y)

Finding Where Vectors (variables) are the Same and Different union(x, y); intersect(x, y); setdiff(x, y); setdiff(y,x); setequal(x, y) ; union combine and remove duplicates intersect duplicates of the 2 lists set diff what x has that y does not set.equal is set x the same as set y is.element or %in% what elements of set x are the same as set y is.element(x, y)

x <- c(sort(sample(1:20, 9)),NA) y <- c(sort(sample(3:23, 7)),NA) x;y length(x);length(y) union(x, y) ; length(union(x, y)) #combine and remove duplicates intersect(x, y)#duplicates of the 2 lists setdiff(x, y)#what x has that y does not setdiff(y, x) #what y has that x does not setequal(x, y) #is set x the same as set y a<-union(x,y) b<-c(setdiff(x,y), intersect(x,y), setdiff(y,x)) setequal(a,b);sort(a);sort(b) is.element(x, y)# what elements of set x are the same as set y is.element(y, x)# what elements of set y are the same as set x cat("\b%in% IS THE SAME AS is.element()\n") a%in%b;x%in%y;y%in%x

Matching Extenders %w/% x without y (same as setdiff) %IN% x and y overlap (Same as Intersect) Examples
"%w/o%" <- function(x, y) x[!x %in% y] #-(1:10) %w/o% c(3,7,12) [1] 1 2 4 5 6 8 9 10 (1:10) %in% c(3,7,12) [1] FALSE FALSE TRUE FALSE FALSE FALSE "%IN%" <- function(x, y) x[x %in% y] #-(1:10) %IN% c(3,7,12) [1] 3 7 x without y

TRUE FALSE FALSE FALSE x and y

37 | P a g e

Intersect multiple vectors See Overlap in User Defined Functions


Reduce(intersect, list(...)) a <- c(1,3,5,7,9) b <- c(3,6,8,9,10) c <- c(2,3,4,5,7,9) Reduce(intersect, list(a,b,c))

Find Rows Not Shared by Two Nested Data Sets x[! data.frame(t(x)) %in% data.frame(t(y)), ] Where x is the full data set and y is the nested data set
A <- mtcars B <- subset(mtcars, cyl==6) A[! data.frame(t(A)) %in% data.frame(t(B)), ]

38 | P a g e

Recoding Variables method 1 recode variables recode columns levels (variable)<-c("new names") This has one to one correspondence levels (variable)<- list(new1=c("A","C") etc) This can combine levels
EXAMPLE InsectSprays2<-InsectSprays levels(InsectSprays2$spray) levels(InsectSprays2$spray)<-list(new1=c("A","C"),YEPS=c("B","D","E"),LASTLY="F") levels(InsectSprays2$spray) InsectSprays2

Recoding Variables method 2

library(doBy)

recodeVar(x, src=c(), tgt=c(), default=NULL, keep.na=TRUE)


e30<-read.table("e30.csv", header=TRUE, sep=",",na.strings="NA") library(doBy) AIDE<-recodeVar(e30$aide,src=c(0,1,NA),tgt=c("YES","NO",NA)) CLASS.TP<-recodeVar(e30$cl.type,src=c(1,2,3,NA),tgt=c("AM","PM","FULL",NA)) CLASS.BH.SPR<recodeVar(e30$cl.behav.spr,src=list(c(1:3),c(4,5),NA),tgt=c("POOR","GOOD",NA)) #This one is a complete recoding of a variable with cut points DDD<-data.frame(AIDE,CLASS.TP,CLASS.BH.SPR) DDD[620:630,] NAhunter(DDD)

Cut Points (Chop a numeric variable into a factor) [NOT RECOMMENDED] cut(x, breaks, labels = NULL, include.lowest = F, right = T, dig.lab = 3, ordered_result = F)
EXAMPLE aaa <- c(1.2,2.2,3,4.1,.7,2,pi,4,5.3434343344,6.245,pi/3) cut(aaa, 3) cut(aaa, 3, dig.lab=4, ordered = TRUE) cut(aaa, 3, labels=c("low","medium","high"), ordered = T) (BBB<-cut(aaa, 3, labels=c("low","medium","high"), ordered = F)) (DF<-data.frame("OBS"=LETTERS[1:11],"LEVEL"=BBB,"NUM.LVL"=aaa)) round(DF,digits=2)#Can't round with factors so... DF2<-DF DF$NUM.LVL<-round(DF$NUM.LVL,digits=2) list("METHOD 1"=format(DF2,digits=3),"METHOD 2"=DF,"USING"="DF$NUM.LVL<round(DF$NUM.LVL,digits=2)" ) mtcars (mpg.rating<-with(mtcars,cut(mpg,3))) levels(mpg.rating)<-c("low","medium","high") (mtcars2<-data.frame(mtcars,mpg.rating)) mtcars[with(mtcars2,which(mpg.rating=="high")),] table(mpg.rating) mtcars$HPcut<-cut(mtcars$hp, breaks=c(0,66,110,150,335), labels=c("low","medium","high", "super"), include.lowest=TRUE, right=FALSE)

39 | P a g e

Relevel a factor method 1 Relevel factor Quick Reference

[reorder factor groups] (& recode numeric to factor [see cut points for more on this])

set.seed(12) z <-factor(sample(LETTERS[1:5], 10, T));z factor(z, levels=c("C", "D", "A", "B"))

#the releveling

> set.seed(12) > z <-factor(sample(LETTERS[1:5], 10, T));z [1] A E E B A A A D A A Levels: A B D E > factor(z, levels=c("C", "D", "A", "B")) [1] A <NA> <NA> B A A A D Levels: C D A B

Extra Reference
dataset$factorgroup <- factor(dataset$factorgroup, levels = c("c","a","b"),ordered=is.ordered(factor))
mtcars2<-mtcars mtcars2$carb[mtcars2$carb==4]<-NA mtcars2 mtcars2$carb[is.na(mtcars2$carb)]<-4 mtcars2 mtcars2$carb[mtcars2$carb<=4&mtcars2$carb>=3]<-"med" mtcars2$carb[mtcars2$carb<=2&mtcars2$carb>=1]<-"low" mtcars2$carb[mtcars2$carb<=8&mtcars2$carb>=6]<-"high" mtcars2 with(mtcars2,mtcars2[order(carb),]) mtcars2$carb <-as.factor(mtcars2$carb) levels(mtcars2$carb) mtcars2$carb <- factor(mtcars2$carb, levels = c("low","med","high"))

mtcars2$carb

Relevel a factor method 2 (order 1 group; kinda junky) relevel(x, ref, ...) # only places one group at the front (limited)
EXAMPLE warpbreaks$tension warpbreaks$tension2 <- relevel(warpbreaks$tension, ref="M") warpbreaks$tension2

Drop unused levels 1 droplevels(x)

drop factor levels

x is a factor vector/dataframe with factors Drop unused levels 2 x <- factor(x) drop factor levels

40 | P a g e

Add an observation to a vector/column method 1 append(vector, new items to add, after=#)

old<-c(1:10) EXAMPLE [1] 1 2 3 4 5 6 7 8 9 10 new<-append(old,c(3,6,9),after=4) [1] 1 2 3 4 3 6 9 5 6 7

9 10

Add an observation to a vector/column method 2 (preffered for its speediness)


EXAMPLE mtcars2<-mtcars mtcars2;mtcars2$mpg[3] mtcars2$mpg[3]<-25 mtcars2 MPG<-mtcars2$mpg MPG[40]<-25 MPG #[R] fills in the gap w/ NAs

Combine characters/numbers in one column (variable) with that of another NOTE: you can do this in EXCEL using cell#& " " &cell# example: G1& " " &H1 (whatever is between the will be the divider of the characters) paste(x,y,sep= " ") x,y,z are the variable characters/numbers to combine togeth er. Whatever is between the " " will be the character separator.
library(doBy) x1<-recodeVar(sample(1:26,25,replace=T), src=1:26, tgt=letters, default=NULL, keep.na=TRUE) y1<-recodeVar(sample(1:26,25,replace=T), src=1:26, tgt=LETTERS, default=NULL, keep.na=TRUE) z1<-sample(1:26,25,replace=T) merged.characters<-paste(x1,z1,y1,sep="") data.frame(x1,z1,y1,merged.characters) paste(x1,z1,y1,sep="-")#variation

Paste unknown number of columns apply(x, 1, paste, collapse = ".") apply(x, 1, function(x){if(any(is.na(x))){NA}else{paste(x, collapse = ".")}}) #if any NA returns NA

#EXAMPLES CO2[1,1] <- NA x <- CO2[, 1:3] y <- CO2[, 1:4] apply(x, 1, paste, collapse = ".") apply(x, 1, function(x){if(any(is.na(x))){NA}else{paste(x, collapse = ".")}}) #do.call METHOD y <- as.list(CO2[1:3]) # make it a list y$sep = "." # set our separator do.call("paste", y)

41 | P a g e

Matrices & Data frames The difference between a matrix and a data frame is that the matric must have all the same type of data (eg. numeric, character etc). A data frame may have mixed comumns of data. Turn a Vector Into a Matrix EXAMPLE
b<-1:20 b dim(b)<-c(4,5) b

Creating a Matrix

Change upper and lower triangle of matrix lower.tri(x, diag = FALSE) upper.tri(x, diag = FALSE) ARGUMENTS: x a matrix. diag logical. Should the diagonal be included?

EXAMPLE #CREATE A CORRRELATION MATRIX WITH THE LOWER TRIANGLE r^2 VALUES CORmat<-cor(mtcars) lower.tri(CORmat) CORmat[lower.tri(CORmat)]<-CORmat[which(lower.tri(CORmat))]^2

Splicing or gluing together Rows or Columns rbind() cbind() NOTE: rbind or cbind are slow functions for larger data sets. It is usually better to create a black matrix first and then use indexing to put the information into the blank matrix.
Example (aRow <- matrix(NA, ncol=18, nrow=49)) #create a matrix of NAs and then fill aRow[1:44,1:8] <- as.vector(as.matrix(mtcars)) aRow[29:49,11:18] <- as.vector(as.numeric(as.matrix(CO2)))[253:420] aRow #notice it's all numeric aRow[49,1]<-"a" aRow # changes matrix to character

42 | P a g e

Matrix Algebra Transpose X' or XT t(X) Diagonal diag(X) Matrix Multiplication XY X %*% Y Matrix Inverse X-1 solve(X) Outer Product XY' X %o% Y Column Means Returns a vector containing the column means of X colMeans(X) Cross Products X'X crossprod(X) Cross Products X'Y crossprod(X,Y)
#Example of Regression Parameters With Matrix Algebra #FORMULA: b = (X'X)-1(X'y) #DATA midterm <- c(5,7,7,7,9) final <- c(4,5,6,8,10) (SUM <- summary(lm(final~midterm))) #============================================== #ASSIGN DATA TO LETTERS TO FIT MATRIX NOTATION x <- midterm y <- final #============================================== #CONVERT VECTOR x TO MATRIX X WITH PARAMETER #NOTE THE COLUMN OF ONES BEING ADDED IS #FOR THE PARAMETER X <- as.matrix(c(rep(1,length(x)),x)) dim(X)<-c(5,2) X #NOW THE HEAVY LIFTING MADE EASY: (b <- solve(crossprod(X))%*%crossprod(X,y))

43 | P a g e

Text and Character Strings Combine multiple items into a single response scat spaste cat() example: paste() Example
First<-c("Greg","Sue","Sally");Last<-c("Smith","Collins","Peters") Ages<-c(11,12,11);ClassRoster<-data.frame(Last,First,Ages) ClassRoster Students.FL<-paste(First,Last,sep=" ") Students.LF<-paste(Last,First,sep=" ") #============================================================================================================ paste(Students.FL,"is",Ages,"years","old.") #............................................................................................................ #OUTPUT-->[1] "Greg Smith is 11 years old." "Sue Collins is 12 years old." "Sally Peters is 11 years old." #============================================================================================================ cat(paste(Students.FL,"is",Ages,"years","old",colapse=",",sep=" ")) #............................................................................................................ #OUTPUT-->Greg Smith is 11 years old , Sue Collins is 12 years old , Sally Peters is 11 years old ,

Input cat("The data and time is",date(),"!","\n") Output The data and time is Thu May 05 10:17:47 2011 !

Pretty print number with commas and import numbers into commas into R prettyNum(string(), big.mark=",", scientific=F)) as.numeric(gsub(",","", string()))
noquote(prettyNum(12345.678,big.mark=",",scientific=F)) x<-noquote(prettyNum(c(12345.678, 123154543, 32434343),big.mark=",",scientific=F)) as.numeric(gsub(",","", x))

Turn a character string into a formula sforumula as.formula()


test<-c("mpg", "cyl", "disp", "hp", "drat", "wt", "qsec", "vs", "am", "gear", "carb") lm(as.formula(paste(test[1],"~", paste(test[-1],collapse="+"), sep="")), data=mtcars)

44 | P a g e

Special Escaped Characters Using cat() with Quotes to manipulate text output QUOTES
\n \r \t \b \a \f \v \\ \' \"
EXAMPLES cat("\a","Hello","\n","Hel","\blo","\'\tHELLO!\'","\"WElp!\"","\n", "DELETE ME","","\r\"I DID DELETE YOU!\"","\n","BYE","\\Yeah Backslash", "\n")

newline carriage return tab cat(LETTERS,"\n","\r") backspace alert (bell) form feed vertical tab backslash \ ASCII apostrophe ' OR sQuote(phrase) ASCII quotation mark " OR dQuote(phrase)

Eliminate the "\" from strings


(test <- c("\\hi\\", "\n", "\t", "\\1", "\1", "\01", "\001")) eval(parse(text=gsub("\\", "", deparse(test), fixed=TRUE))) #INPUT #[1] "\\hi\\" "\n" #OUTPUT #[1] "hi" "n" "t" "\t" "1" "\\1" "\001" "\001" "\001"

"001" "001" "001"

45 | P a g e

Built in Character Strings constants


LETTERS [1] "A" "B" "C" "D" "E" "F" "G" "H" "I" "J" "K" "L" "M" [14] "N" "O" "P" "Q" "R" "S" "T" "U" "V" "W" "X" "Y" "Z" letters [1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l" "m" [14] "n" "o" "p" "q" "r" "s" "t" "u" "v" "w" "x" "y" "z" month.abb [1] "Jan" "Feb" "Mar" "Apr" "May" "Jun" "Jul" "Aug" "Sep" "Oct" "Nov" "Dec" month.name [1] "January" "February" "March" "December" state.name state.abb

"April"

"May"

"June"

"July"

"August" "September" "October" "November"

Remove quotes for printing noquote(letters) >[1] a b c d e f g h i j k l m n o p q r s t u v w x y z cat(letters) >a b c d e f g h i j k l m n o p q r s t u v w x y z Number of letters per word in a character string nchar(character string) yields the number of characters per word example: pets<-c("chester","callie");nchar(pets)

46 | P a g e

Replacing Characters in String gsub(pattern, replacement, x, ignore.case = FALSE, perl = FALSE, fixed = FALSE, useBytes = FALSE)
Arguments pattern character string containing a regular expression (or character string for fixed = TRUE) to be matched in the given character vector. Coerced by as.character to a character string if possible. If a character vector of length 2 or more is supplied, the first element is used with a warning. Missing values are allowed except for regexpr and gregexpr. a character vector where matches are sought, or an object which can be coerced by as.character to a character vector. logical. Should perl-compatible regexps be used? Has priority over extended. if FALSE, a vector containing the (integer) indices of the matches determined by grep is returned, and if TRUE, a vector containing the matching elements themselves is returned. logical. If TRUE, pattern is a string to be matched as is. Overrides all conflicting arguments. logical. If TRUE the matching is done byte-by-byte rather than character-by-character. See Details. logical. If TRUE return indices or values for elements that do not match.

x, text

ignore.case if FALSE, the pattern matching is case sensitive and if TRUE, case is ignored during matching. perl value fixed useBytes invert

replacement a replacement for matched pattern in sub and gsub. Coerced to character if possible. For fixed = FALSE this can include backreferences "\1" to "\9" to parenthesized subexpressions of pattern. For perl = TRUE only, it can also contain "\U" or "\L" to convert the rest of the replacement to upper or lower case and "\E" to end case conversion. If a character vector of length 2 or more is supplied, the first element is used with a warning. If NA, all elements in the result corresponding to matches will be set to NA.

Each of these functions (apart from regexec, which currently does not support Perlstyle regular expressions) operates in one of three modes:
1. 2. 3. fixed = TRUE:

use exact matching. perl = TRUE: use Perl-style regular expressions. fixed = FALSE, perl = FALSE: use POSIX 1003.2 extended regular expressions.

47 | P a g e

EXAMPLE 1 input a <- c("foo_5h", "bar_7") gsub(".*_", "", a) b <- c("xtfo_oin5hl", "6b_arin7", "xin7") gsub("in.", "", b) gsub("t.*l", "HERE", b) gsub("^([a-zA-Z]in)", "INSERT", b) d <- c("xtfo_oin5h;lx", "6b_arin;7", "xin;7") gsub("t.+?l", "HERE", d) gsub("[a-zA-Z].+?l", "HERE", d) gsub("[a-zA-Z].+?;", "HERE", d) gsub("_.+?;", "HERE", d) e <- c("Dog foo_5h dog bar_7 doGs God") gsub("\\bdog\\b", "HERE", e) gsub("\\bdog.\\b", "HERE", e) gsub("[^a-zA-Z0-9]", "", e) gsub("\\b[dD][oO][Gg].\\b", " ", e) gsub("\\b[dD][oO][Gg]\\b", " ", e)

Match & replace from here to blank

EXAMPLE 1 outcome > a <- c("foo_5h", "bar_7") > gsub(".*_", "", a) [1] "5h" "7" > > b <- c("xtfo_oin5hl", "6b_arin7", "xin7") > gsub("in.", "", b) [1] "xtfo_ohl" "6b_ar" "x" > gsub("t.*l", "HERE", b) [1] "xHERE" "6b_arin7" "xin7" > gsub("^([a-zA-Z]in)", "INSERT", b) [1] "xtfo_oin5hl" "6b_arin7" "INSERT7" > > d <- c("xtfo_oin5h;lx", "6b_arin;7", "xin;7") > gsub("t.+?l", "HERE", d) [1] "xHEREx" "6b_arin;7" "xin;7" > gsub("[a-zA-Z].+?l", "HERE", d) [1] "HEREx" "6b_arin;7" "xin;7" > gsub("[a-zA-Z].+?;", "HERE", d) [1] "HERElx" "6HERE7" "HERE7" > gsub("_.+?;", "HERE", d) [1] "xtfoHERElx" "6bHERE7" "xin;7" > > e <- c("Dog foo_5h dog bar_7 doGs God") > gsub("\\bdog\\b", "HERE", e) [1] "Dog foo_5h HERE bar_7 doGs God" > gsub("\\bdog.\\b", "HERE", e) [1] "Dog foo_5h HEREbar_7 doGs God" > gsub("[^a-zA-Z0-9]", "", e) [1] "Dogfoo5hdogbar7doGsGod" > gsub("\\b[dD][oO][Gg].\\b", " ", e) [1] " foo_5h bar_7 God" > gsub("\\b[dD][oO][Gg]\\b", " ", e) [1] " foo_5h bar_7 doGs God"

EXAMPLE 2 text.ex<-c("hat","coat","gloves","shirt","pants") gsub("h","H",text.ex) gsub("^.","A",text.ex) gsub("(\\w)(\\w*)","\\U\\1\\L\\2",text.ex,perl=T) gsub("(\\w*)","\\U\\1",text.ex,perl=T) gsub("(\\w)(\\w*)(\\w)", "\\U\\1\\E\\2\\U\\3", text.ex, perl=T)
Output > > gsub("^.","\\*b",text) [1] "*bat" "*boat" "*bloves" "*bhirt" "*bants" > gsub("h","H",text) [1] "Hat" "coat" "gloves" "sHirt" "pants" > gsub("^.","A",text) [1] "Aat" "Aoat" "Aloves" "Ahirt" "Aants" > gsub("(\\w)(\\w*)","\\U\\1\\L\\2",text,perl=T) [1] "Hat" "Coat" "Gloves" "Shirt" "Pants" > gsub("(\\w*)","\\U\\1",text,perl=T) [1] "HAT" "COAT" "GLOVES" "SHIRT" "PANTS" > gsub("(\\w)(\\w*)(\\w)", "\\U\\1\\E\\2\\U\\3", text, perl=TRUE) [1] "HaT" "CoaT" "GloveS" "ShirT" "PantS" x<-"ATextIWantToDisplayWithSpaces" gsub('([[:upper:]])', ' \\1', x) #Split on capital letters

> gsub('([[:upper:]])', ' \\1', x) [1] " A Text I Want To Display With Spaces" x<-"I like it...What the... Oh I see it." #replace gsub(pattern = "\\.\\.\\.", replacement = ".", x) #or gsub(pattern = "\\.+", replacement = ".", x) #betterand more flexible > gsub(pattern = "\\.\\.\\.", replacement = ".", x) [1] "I like it.What the. Oh I see it."

48 | P a g e

Replacing Certain Occurances


string <- c('sta_+1+0_field2ndtry_0000$01.cfg' , 'sta_+B+0_field2ndtry_0000$01.cfg' , 'sta_+1+0_field2ndtry_0000$01.cfg' , 'sta_+9+0_field2ndtry_0000$01.cfg') sapply(1:length(string), function(i)gsub("\\+(.*)\\+.", paste("\\+\\1\\+", i, sep=""), string[i])) > string [1] "sta_+1+0_field2ndtry_0000$01.cfg" "sta_+B+0_field2ndtry_0000$01.cfg" [3] "sta_+1+0_field2ndtry_0000$01.cfg" "sta_+9+0_field2ndtry_0000$01.cfg" > sapply(1:length(string), function(i)gsub("\\+(.*)\\+.", paste("\\+\\1\\+", i, sep=""), string[i])) [1] "sta_+1+1_field2ndtry_0000$01.cfg" "sta_+B+2_field2ndtry_0000$01.cfg" [3] "sta_+1+3_field2ndtry_0000$01.cfg" "sta_+9+4_field2ndtry_0000$01.cfg"

Find and remove space and/or numeric occurances


#EXAMPLE 1 #========= data <- c("Flagstaff 2", "Los Angeles 23", "Cleveland 29", "Cleveland 29", gsub("\\s*\\d*$", "", data) [1] "Flagstaff" "Los Angeles" "Cleveland" "Cleveland" "Seattle" #EXAMPLE 2 #========= x <- "the dog ate his \n food" gsub("[^o h \n]", "", x) gsub("[^o h \n]|\\s+", "", x) > gsub("[^o h \n]", "", x) [1] "h o h \n oo" > gsub("[^o h \n]|\\s+", "", x) [1] "hohoo"

"Seattle 22")

Find Consecutive Occurences


mystring <- c(1, 2, 3, "toot", "tooooot", "good", "apple", "banana", "frrr") mystring[!grepl("(.)\\1{2,}", mystring)] mystring[!grepl("(.)\\1{1,}", mystring)] gsub("(.)\\1{2,}", "HELLO", mystring)
## ## ## ## ## ## > mystring[!grepl("(.)\\1{2,}", mystring)] [1] "1" "2" "3" "toot" "good" "apple" "banana" > mystring[!grepl("(.)\\1{1,}", mystring)] [1] "1" "2" "3" "banana" > gsub("(.)\\1{2,}", "HELLO", mystring) [1] "1" "2" "3" "toot" "tHELLOt" "good" "apple"

"banana"

"fHELLO"

49 | P a g e

Find Location of Chunks within a String(s)

Find a pattern in a string or a vector of strings gregexp(pattern, text, ignore.case = FALSE, perl = FALSE, fixed = FALSE, useBytes = FALSE)
gregexpr("es", "Testes") c(gregexpr("es", "Test")[[1]]) c(gregexpr("es", "Testes")[[1]]) c(gregexpr("es", "Testes establishes esteem")[[1]]) gregexpr("es", c("Testes", "dog", 6, "esteem")) #vector of strings

Find a pattern in a vector of strings


Gives Location

grep(pattern, x, ignore.case = FALSE, perl = FALSE, value = FALSE, fixed = FALSE, useBytes = FALSE, invert = FALSE)
Gives Logical TRUE/FALSE

grepl(pattern, x, ignore.case = FALSE, perl = FALSE, fixed = FALSE, useBytes = FALSE)


grep("es", c("Testes", "dog", 6, "esteem")) #[1] 1 4 grepl("es", c("Testes", "dog", 6, "esteem")) #[1] TRUE FALSE FALSE TRUE

50 | P a g e

String Splitting Split on first space


string<- factor(c("California CA", "New York NY", "Georgia GA"))

#Cheesy Method string <- gsub(" +", " ", string) sapply(string, function(x) substring(x, 1, nchar(x)-3)) #or unlist(lapply(string, function(x) substring(x, 1, nchar(x)-3))) or #sub Method (BETTER!) sub("[[:space:]]*..$", "", string) #OUTPUT [1] "California" "New (x<-rownames(mtcars)) rexp <- "^(\\w+)\\s?(.*)$" sub(rexp,"\\1",x) sub(rexp,"\\2",x) data.frame(COM=x, MANUF=sub(rexp,"\\1",x), MAKE=sub(rexp,"\\2",x)) 27 28 29 30 31 32 Porsche 914-2 Lotus Europa Ford Pantera L Ferrari Dino Maserati Bora Volvo 142E York" "Georgia"

#Also could have been solved:


27 28 29 30 31 32 COM MANUF MAKE Porsche 914-2 Porsche 914-2 Lotus Europa Lotus Europa Ford Pantera L Ford Pantera L Ferrari Dino Ferrari Dino Maserati Bora Maserati Bora Volvo 142E Volvo 142E #METHOD 2 mat <- do.call("rbind", strsplit(sub(" ", ";", x), ";")) colnames(mat) <- c("MANUF", "MAKE") #METHOD 3 library(reshape2) y <- reshape2::colsplit(x," ",c("MANUF","MAKE")) tail(y) #METHOD 4 library(stringr) split_x <- str_split(x, " ", 2) y <- data.frame( MANUF = sapply(split_x, head, n = 1), MAKE = sapply(split_x, tail, n = 1) ) tail(y)

str <- c("George W. gsub("([A-Z])[.]?", sub(" .*", "", str) sub("\\s\\w+$", "", sub(".*\\s(\\w+$)",

Bush", "Lyndon B. Johnson") "\\1", str) str) "\\1", str)

str <- c("&George W. Bush", "Lyndon B. Johnson?") gsub("[^[:alnum:][:space:].]", "", str)

51 | P a g e

> str <- c("&George W. Bush", "Lyndon B. Johnson?") > gsub("[^[:alnum:][:space:].]", "", str) [1] "George W. Bush" "Lyndon B. Johnson" > str <- c("George W. Bush", "Lyndon B. Johnson") > gsub("([A-Z])[.]?", "\\1", str) [1] "George W Bush" "Lyndon B Johnson" > sub(" .*", "", str) [1] "George" "Lyndon" > sub("\\s\\w+$", "", str) [1] "George W." "Lyndon B." > sub(".*\\s(\\w+$)", "\\1", str) [1] "Bush" "Johnson" > > str <- c("&George W. Bush", "Lyndon B. Johnson?") > gsub("[^[:alnum:][:space:].]", "", str) [1] "George W. Bush" "Lyndon B. Johnson"

Split on first underscore character


library(reshape2) my_var_1 <- factor(c("x00_aaa_123","x00_bbb_123","x00_ccc_123", "x01_aaa_123","x01_bbb_123","x01_ccc_123","x02_aaa_123","x02_bbb_ 123","x02_ccc_123")) colsplit(my_var_1, "_", c("x","whatever")) 1 2 3 4 5 6 7 8 9 x whatever x00 aaa_123 x00 bbb_123 x00 ccc_123 x01 aaa_123 x01 bbb_123 x01 ccc_123 x02 aaa_123 x02 bbb_123 x02 ccc_123

Split on first comma


y <- c("Here's comma 1, and 2, see?", "Here's 2nd sting, like it, not a lot.") #Method 1 XX <- "SoMeThInGrIdIcUlOuS" LIST <- strsplit(sub(",\\s*", XX, y), XX) LIST2 <- lapply(LIST, function(x) data.frame('x'=c(x[1]), 'z'=c(x[2]))) do.call('rbind', LIST2) #Method 2 y2 <- strsplit(y, ",") LIST <- sapply(seq_along(y2), function(i) data.frame(x= y2[[i]][1], z=paste(y2[[i]][-1], collapse=" ")), simplify=F) do.call('rbind', LIST) #Method 3 GL(reshape2) colsplit(y, ",", c("x","z"))

52 | P a g e

Piece Grabbing Grab part


x <- c( "ELOVL7", "ELP2", "EMC1 (includes EG:23065)", "EPT1 (includes EG:28042)", "ZEB1 (includes EG:29009)" ) gsub("(.*)\\s+\\(.*\\)", "\\1", x)

Test and grab certain occurances (eg beginning abc and ending some numeric)
#example 1 #========== s <- c('abc1', 'abc2', 'abc3', 'abc11', 'abc12', 'abcde1', 'abcde2', 'abcde3', 'abcde11', 'abcde12', 'nonsense') s[grepl("abc.*(3|11|12)", s)] s[grepl("^abc", s) & grepl("(3|11|12)$", s)] #^ means negate or everything except the abc (2nd one is more interpretable) > s[grepl("abc.*(3|11|12)", s)] [1] "abc3" "abc11" "abc12" "abcde3" "abcde11" "abcde12" > s[grepl("^abc", s) & grepl("(3|11|12)$", s)] [1] "abc3" "abc11" "abc12" "abcde3" "abcde11" "abcde12" #example 2 #========== x <- c("fcer cgr tr cg g.", "gce tgv te ger refxre,c3rfc rf3rcf3rfr?") x[grepl("^[[:alpha:]]", x) & grepl("(\\?)$", x)] x[grepl("^[[:alpha:]]", x) & grepl("(\\?|\\.)$", x)] > x[grepl("^[[:alpha:]]", x) & grepl("(\\?)$", x)] [1] "gce tgv te ger refxre,c3rfc rf3rcf3rfr?" > x[grepl("^[[:alpha:]]", x) & grepl("(\\?|\\.)$", x)] [1] "fcer cgr tr cg g." [2] "gce tgv te ger refxre,c3rfc rf3rcf3rfr?"

Grabe everything except last word


df1 <- structure(list(id = c(1, 2, 3), city = structure(c(2L, 3L, 1L ), .Label = c("Hillside Village", "Middletown Township", "Sunny Valley Borough" ), class = "factor")), .Names = c("id", "city"), row.names = c(NA, -3L), class = "data.frame") gsub("\\s*\\w*$", "", df1$city) > gsub("\\s*\\w*$", "", df1$city) [1] "Middletown" "Sunny Valley" "Hillside"

53 | P a g e

Split apart by chunks


test <- "abc123def" x <- gsub("([0-9]+)","~\\1~", test) strsplit(x, "~") #or in one step strsplit(gsub("([0-9]+)","~\\1~", test), "~") [[1]] [1] "abc" "123" "def"

54 | P a g e

Punctuation Delete all punctuation except


#EXAMPLE x <- "I like %$@to*&, chew;: gum, but don't like|}{[] bubble@#^)( gum!?" #METHODS FOR SUBBING OUT ALL PUNCTUATUION EXCEPT APOSTROPHES gsub("[^[:alnum:][:space:]'\"]", "", x) #METHOD 1 gsub(".*?($|'|[^[:punct:]]).*?", "\\1", x) #METHOD 2 gsub("(.*?)($|'|[^[:punct:]]+?)(.*?)", "\\2", x) #METHOD 3 #EXTENDING METHOD 1 TO SUB OUT EVERYTHING EXCEPT APOSTROPHES AND SEMI COLONS gsub("[^[:alnum:][:space:]'\ ;\"]", "", x, perl=T)

55 | P a g e

Capitalization
#Capitalize the first letter of a word capitalize <- function(x) { simpleCap <- function(x) { s <- strsplit(x, " ")[[1]] paste(toupper(substring(s, 1,1)), substring(s, 2), sep="", collapse=" ") } unlist(lapply(x, simpleCap)) } x <- "i'll" y <- "you" z <- c("I'll", "go") capitalize(x) capitalize(y) capitalize(z)

Capital Letters capitalize toupper(string) Lower Case Letters tolower(string)

EXAMPLE string<-toupper(paste("i do not know"," where the dog is.",sep="")) cat(string,"\n",sep="") string tolower(string)

56 | P a g e

String Matching Search a Vector for a Match see my Search() function

Exact Matches grep(pattern, x, ignore.case = FALSE, perl = FALSE, value = FALSE, fixed = FALSE, useBytes = FALSE, invert = FALSE) Arguments:

Approximate Matches agrep(pattern, x, ignore.case = FALSE, value = FALSE, max.distance = 0.1, useBytes = FALSE) Arguments:

animals <- c("mose", "dog", "cat", "gooberciluousrex") animals[agrep("mouse", animals, max.distance = 0.01)] animals animals[agrep("chese", animals)] <- "mouse" str1 <- "This is a animals string, that I've written to ask about a question, or animals[agrep("goobercilsdeef", animals, max.distance animals animals[agrep("goobercilsdeef", animals, max.distance animals <- "cheese"

at least tried to." = 0.01)] <- "duck" = 0.29)] <- "duck"

57 | P a g e

My Search() function
Search<-function(term,dataframe,column.name,variation=.02,...){ te<-substitute(term) #use " " for multi word terms te<-as.character(te) cn<-substitute(column.name) cn<-as.character(cn) HUNT<-agrep(te,dataframe[,cn],ignore.case =TRUE,max.distance=variation,...) dataframe[c(HUNT),] }

Search for a term

str <- "BBSSHHSRBSBBS" unlist(gregexpr("BS", str)) str2 <- "I can't stand know it all egg head scientists." unlist(gregexpr("i can't", tolower(str2))) term <- "egg head" loc <- unlist(gregexpr(term, tolower(str2))) substring(str2, loc, nchar(term)-1+loc)

Search for a term and Count Occurances


str2 <- "ionisation should only be matched at the end of the word" matched_commas <- gregexpr(",", str1, fixed = TRUE) length(matched_commas[[1]]) matched_ion <- gregexpr("ion", str1, fixed = TRUE) length(matched_ion[[1]]) length(gregexpr("ion\\b", str2, perl = TRUE))

Search for Strings that contain a phrase


a <- c('This is a healthcare facility', 'this is a hospital', 'this is a hospital district', 'this is a district health service') a[grepl("hospital", a) & !grepl("district", a)] a[!grepl("district", a)]

Levenshtein distance between strings


pres <- c(" Obama, B.","Bush, G.W.","Obama, B.H.","Clinton, W.J.") lapply(pres, agrep, pres, value = F) lapply(pres, agrep, pres, value = T)

58 | P a g e

Test for occurance in Columns


myfile <- read.table( text = '"G1" SEP11 ABCC1 205772_s_at FMO2 214223_at ADAM19 ANK2 215742_at COPS4 BIK 214808_at DCP1A ACE ALG3 BAD 215369_at EMP3 215385_at CARD8 217579_x_at ', header = TRUE, stringsAsFactors = FALSE) "G2"

lapply( myfile, function(column) grep( "_at$", column, invert = TRUE, value = TRUE ) ) lapply( myfile, function(column) grep( "_at$", column, ) value = TRUE )

lapply( myfile, function(column) grep( "_at$", column, invert = TRUE ) )

59 | P a g e

Insert characters between characters of a character string See also "Insert a Vector of Character"
x <- "output" Method 1 y <- unlist(strsplit(x, NULL)) y <- paste(y, collapse="\n") cat(y) [1] "o\nu\nt\np\nu\nt" Method 2 z <- gsub('(?<=.)(?=.)','\n', x, perl=TRUE) cat(z) [1] "o\nu\nt\np\nu\nt"

Insert a Vector of Character Strings Into Another Character String


a <- c("string", "factor") sprintf("This is where a %s goes.", a) sprintf("This is where a %s goes.", a) [1] "This is where a string goes." "This is where a factor goes."

Insert Vector(s) of Character Strings Into Another Character String


#paste method n <- 10; a <- 1:n paste0("p", a, "=", a) #sprintf method n <- 10; a <- 1:n sprintf("p%d=%d", a, a)

Insert Trailing or Leading Spaces Easily


x <- c("I like", "good", "better than you") sprintf("%8s", x) #Add leading space sprintf("%-8s", x) #Add trailing space > sprintf("%8s", x) #Add leading space [1] " I like" " good" "better than you" > sprintf("%-8s", x) #Add trailing space [1] "I like " "good " "better than you"

Delete Trailing or Leading Spaces Easily gsub("^\\s+, "", x) #leading spaces gsub("\\s+$", "", x) #trailing spaces Trim <- function (x) gsub("^\\s+|\\s+$", "", x) Insetr Leading Zeros sprintf("%02d",c(1,2,3,45))
> sprintf("%02d",c(1,2,3,45)) [1] "01" "02" "03" "45" > sprintf("%03d",c(1,2,3,45)) [1] "001" "002" "003" "045" > sprintf("%010d",c(1,2,3,45)) [1] "0000000001" "0000000002" "0000000003" "0000000045"

60 | P a g e

Reverse character strings reverse(string)


reverse("this is a string") strings1 <- c(123,4212,234567) reverse(strings1)

reverse <- function(string) { strReverse <- function(x) sapply(lapply(strsplit(x, NULL), rev), paste, collapse = "") if (is.numeric(string)) { strReverse(as.character(string)) } else { strReverse(string) } }

strings2 <- c("retsnomerom","was","retar") reverse(strings2)

61 | P a g e

Select portions of a character string substr(text,start point,end point)

(parts of a character string)

EXAMPLES substr("abcdefghi",2,5) substr("abcdefghi",1,8) substring("Callie loves to chew bones!",8,20) substring("Callie loves to chew bones!",28:1) substring("Callie loves to chew bones!",1:28) data.frame(cbind(substring("Callie loves to chew bones!",28:1), substring("Callie loves to chew bones!",1:28))) substr(rep("abcdef",4),1:4,4:5) x <- c("asfef", "qwerty", "yuiop[", "b", "stuff.blah.yech") substr(x, 2, 5) substring(x, 2, 4:6) substring(x, 2) <- c("..", "+++") x #USE TO PULL APART BEDS CODE beds.numbs<-as.character(c(3452171,3452172,3462173,3452274,3452275,3462276,3462277,3452178, 3452189,3452080,3452081,3452082,3462083)) (Region<-substr(beds.numbs,1,3))#use this to recode regions (District<-substr(beds.numbs,1,6)) DATS<-data.frame(beds.numbs,Region,District) with(DATS,table(District)) with(DATS,ftable(DATS)) subset(DATS,Region=="346") subset(DATS,District=="345208")

62 | P a g e

Create a diminishing list from a vector of names

PREDS<-c("gender", "g1freelunch", "g3tmathss", "g3treadss", "yearssmall", "crap") #===============================================

method 1

getDiminishingList<-function(data){ ans <- list() for(i in 1:length(data)){ ans[[i]] <- data[1:(length(data) - i + 1)] } ans } # Use function getDiminishingList(PREDS) getDiminishingList(1:10) #===============================================

method 2

getDiminishingList <- function(data){ n <- length(data) tmpfunc <- function(i){ data[1:(length(data) - i + 1)] } return(apply(matrix(1:n), 1, tmpfunc)) } # Use function getDiminishingList(PREDS) getDiminishingList(1:10)
Output [[1]] [1] "gender" [6] "crap" [[2]] [1] "gender" [[3]] [1] "gender" [[4]] [1] "gender" [[5]] [1] "gender" [[6]] [1] "gender"

"g1freelunch" "g3tmathss"

"g3treadss"

"yearssmall"

"g1freelunch" "g3tmathss"

"g3treadss"

"yearssmall"

"g1freelunch" "g3tmathss"

"g3treadss"

"g1freelunch" "g3tmathss"

"g1freelunch"

63 | P a g e

Convert a Character String or Factor to Numeric


Method 1 y <- c( "OLDa", "ALL", "OLDc", "OLDa", "OLDb", "NEW", "OLDb", "OLDa", "ALL") el <- c("OLDa", "OLDb", "OLDc", "NEW", "ALL") match(y,el) Method2 f <- factor(y,levels=c("OLDa", "OLDb", "OLDc", "NEW", "ALL") ) as.integer(f)

Changing Variable (numeric vs. factor) See Dummy Coding Type: y <- as.factor(y) changes the variable to factor y <- as.numeric(y) changes the variable to numeric Note 1: This function can be use to change a categorical variable into a numeric variable (useful for dummy coding)

Or recode as 0,1

Note 2: If youve renamed the variables in your data set you must use the as.numeric function with the original data set terms (data.set$variable.name) to turn the list in the actual data set numeric (see incorrect [A] vs. correct [B] below).

64 | P a g e

Dummy Coding a Factor method 1 dummy(dataframe)

User defined function requires library(ade4)

#EXAMPLE dummy <- function(df) { require(ade4) ISFACT <- sapply(df, is.factor) FACTS <- acm.disjonctif(df[, ISFACT, drop = FALSE]) NONFACTS <- df[, !ISFACT,drop = FALSE] data.frame(NONFACTS, FACTS) } df <-data.frame(eggs = c("foo", "foo", "bar", "bar"), ham = c("red","blue","green","red"), x=rnorm(4)) dummy(df)

Dummy Coding a Factor method 2 model.matrix(~factor-1)


#EXAMPLE x <- c(2, 2, 5, 3, 6, 5, NA) xf <- factor(x, levels = 2:6) model.matrix( ~ xf - 1)

65 | P a g e

Convert numbers to Roman numerals as.roman(vector)


EXAMPLES as.roman(101) #coverts to roman numeral as.roman(c(101,23,67,92)) #vector

Convert Decimals to fractions

library(MASS)

fractions(x, cycles = 10, max.denominator = 2000, ...) Arguments:

EXAMPLES library(MASS) fractions(.12) fractions(pi)

NOTE: This is Rational Approximation and may not be a true value of the decimal

66 | P a g e

Note: the package chron is good at handling dates and times Date & Time Date & Time date() Date Sys.Date() Time substr(as.character(Sys.time()),12,19) Year substr(as.character(Sys.Date()),1,4) Date/Time/Time Zone Sys.time() Extracting pieces from Sys.Date and Sys.time format(Sys.time(), "%a %b %d %H:%M:%S %Y")
%a=weekday; %b=month; %d=day of the month; %H:%M:%S =hour:minute:second; %Y=year

Use of cat with \n gets rid of the quotes around the date (see final example) cat(format(Sys.time(), "%a %b %d %H:%M:%S %Y"),"\n")
EXAMPLE format(Sys.Date(), format="%b %d %Y") format(Sys.Date(), format="%a %b %d %Y") format(Sys.time(), "%a %b %d %H:%M:%S %Y") format(Sys.time(),"%H:%M") dec1 <- as.Date("2004-12-1") cat(format(dec1, format="%b %d %Y"),"\n") #Notice how cat eliminates the quotes

Import dates in various formats such as dd/mm/yyyy as.Date(x, format = "")


Arguments

EXAMPLE dates <- c("02/27/92", "02/27/92", "01/14/92", "02/28/92", "02/01/92") as.Date(dates, "%m/%d/%y") dates <- c("02/27/1992", "02/27/1992", "01/14/1992", "02/28/1992", "02/01/1992") as.Date(dates, "%m/%d/%Y")

67 | P a g e

Differences in Dates and Times difftime(t1,t2) Note: put the later time in for t2 EXAMPLES

Units Argument difftime(time1, time2, tz, units = c("auto", "secs", "mins", "hours","days", "weeks")) Can request answer be given in "auto", "secs", "mins", "hours","days", "weeks"

difftime("2005-10-21","1980-11-16") as.numeric(difftime("2005-10-21","1980-11-16")) difftime("2011-05-17 00:35:07","2002-9-11 8:46:40") difftime(Sys.time(),"2002-9-11 8:46:40")

Output
> difftime("2005-10-21","1980-11-16") Time difference of 9104.958 days > as.numeric(difftime("2005-10-21","1980-11-16")) [1] 9104.958 > difftime("2011-05-17 00:35:07","2002-9-11 8:46:40") Time difference of 3169.659 days > difftime(Sys.time(),"2002-9-11 8:46:40") Time difference of 3169.661 days

Time and Date Sequence seq.Date(from,to,by) by= "day", "week", "month" or "year" Turn dates into Day of the Week weekdays(x, abbreviate=FALSE)

EXAMPLE C<-seq.Date(as.Date("2010-10-10"),Sys.Date(),"week") data.frame("OBS"=1:length(C),C)

df <- data.frame(date=c("2012-02-01", "2012-02-01", "2012-02-02")) df$day <- weekdays(as.Date(df$date)) #turns dates to days of the week df$day.ab <- weekdays(as.Date(df$date), TRUE) > df date day day.ab 1 2012-02-01 Wednesday Wed 2 2012-02-01 Wednesday Wed 3 2012-02-02 Thursday Thu

68 | P a g e

Graphics Open a second Graphics Window (universal) plot.new() Open a second Graphics Window (windows) win.graph() or windows() or x11()

Open a second Graphics Window (mac) quartz() or x11()

Open a second Graphics Window ready to plot (No need to call a plot before adding lines etc.) plot.new() or frame()

This enable you to add lines and text without an actual plot Check the system for OS and return correct graphics device (2 methods)
#covers everything and is safe for other Graphics Devices if (dev.interactive()) dev.new() #covers only gui graphics device and is not safe for other Graphics Devices if( .Platform$GUI %in% c("X11", "Tk") ) { X11() } else { if ( .Platform$GUI == "AQUA" ){ quartz() } else { windows() } } }

Control the Size of the Graph Window windows(width=10, height=4) or win.graph (width=10, height=4) or x11(w=10,h=4) NOTE: All will take just w or h or the specific order of w and then h as in: x11(10,4) Pause Between Switching to Second Graph par(ask=TRUE) Multiple Graphs on one page par(mfrow=c(2,3)) 2 is in the rows position and 3 is in the columns position 69 | P a g e

Graphical output formats


Device Function Screen/GUI Devices x11() or X11() windows() File Devices postscript(file="myplot.ps") pdf(file="myplot.pdf") pictex(file="myplot.tex") bmp(file="myplot.bmp") jpeg(file="myplot.jepg")

Return Current Graphic device dev.cur() Turn Off Graphic Device dev.off() Turns Off all graphics Devices graphics.off() Copy the Current Graphics Device to a File dev.copy(device=png, file="foo", width=500, height=300)

70 | P a g e

Par Function Arguments


adj The value of adj determines the way in which text strings are justified in text, mtext and title. A value of 0 produces left-justified text, 0.5 (the default) centered text and 1 right-justified text. (Any value in [0, 1] is allowed, and on most devices values outside that interval will also work.) Note that the adj argument of text also allows adj = c(x, y) for different adjustment in x- and y- directions. Note that whereas for text it refers to positioning of text about a point, for mtext and title it controls placement within the plot or device region. ann If set to FALSE, high-level plotting functions calling plot.default do not annotate the plots they produce with axis titles and overall titles. The default is to do annotation. ask logical. If TRUE (and the R session is interactive) the user is asked for input, before a new figure is drawn. As this applies to the device, it also affects output by packages grid and lattice. It can be set even on non-screen devices but may have no effect there. This not really a graphics parameter, and its use is deprecated in favour of devAskNewPage. bg The color to be used for the background of the device region. When called from par() it also sets new=FALSE. See section Color Specification for suitable values. For many devices the initial value is set from the bg argument of the device, and for the rest it is normally "white". Note that some graphics functions such as plot.default and points have an argument of this name with a different meaning. bty A character string which determined the type of box which is drawn about plots. If bty is one of "o" (the default), "l", "7", "c", "u", or "]" the resulting box resembles the corresponding upper case letter. A value of "n" suppresses the box. cex A numerical value giving the amount by which plotting text and symbols should be magnified relative to the default. This starts as 1 when a device is opened, and is reset when the layout is changed, e.g. by setting mfrow. Note that some graphics functions such as plot.default have an argument of this name which multiplies this graphical parameter, and some functions such as points accept a vector of values which are recycled. Other uses will take just the first value if a vector of length greater than one is supplied. cex.axis The magnification to be used for axis annotation relative to the current setting of cex. cex.lab The magnification to be used for x and y labels relative to the current setting of cex. cex.main The magnification to be used for main titles relative to the current setting of cex. cex.sub The magnification to be used for sub-titles relative to the current setting of cex. cin R.O.; character size (width, height) in inches. These are the same measurements as cra, expressed in different units. col A specification for the default plotting color. See section Color Specification. (Some functions such as lines accept a vector of values which are recycled. Other uses will take just the first value if a vector of length greater than one is supplied.) col.axis The color to be used for axis annotation. Defaults to "black". col.lab The color to be used for x and y labels. Defaults to "black". col.main The color to be used for plot main titles. Defaults to "black". col.sub The color to be used for plot sub-titles. Defaults to "black". cra R.O.; size of default character (width, height) in rasters (pixels). Some devices have no concept of pixels and so assume an arbitrary pixel size, usually 1/72 inch. These are the same measurements as cin, expressed in different units. crt A numerical value specifying (in degrees) how single characters should be rotated. It is unwise to expect values other than multiples of 90 to work. Compare with srt which does string rotation. csi R.O.; height of (default-sized) characters in inches. The same as par("cin")[2]. cxy R.O.; size of default character (width, height) in user coordinate units. par("cxy") is par("cin")/par("pin") scaled to user coordinates. Note that c(strwidth(ch), strheight(ch)) for a given string ch is usually much more precise. din R.O.; the device dimensions, (width,height), in inches. err (Unimplemented; R is silent when points outside the plot region are not plotted.) The degree of error reporting desired.

71 | P a g e

family The name of a font family for drawing text. The maximum allowed length is 200 bytes. This name gets mapped by each graphics device to a device-specific font description. The default value is "" which means that the default device fonts will be used (and what those are should be listed on the help page for the device). Standard values are "serif", "sans" and "mono", and the Hershey font families are also available. (Different devices may define others, and some devices will ignore this setting completely.) This can be specified inline for text. fg The color to be used for the foreground of plots. This is the default color used for things like axes and boxes around plots. When called from par() this also sets parameter col to the same value. See section Color Specification. A few devices have an argument to set the initial value, which is otherwise "black". fig A numerical vector of the form c(x1, x2, y1, y2) which gives the (NDC) coordinates of the figure region in the display region of the device. If you set this, unlike S, you start a new plot, so to add to an existing plot use new=TRUE as well. fin The figure region dimensions, (width,height), in inches. If you set this, unlike S, you start a new plot. font An integer which specifies which font to use for text. If possible, device drivers arrange so that 1 corresponds to plain text (the default), 2 to bold face, 3 to italic and 4 to bold italic. Also, font 5 is expected to be the symbol font, in Adobe symbol encoding. On some devices font families can be selected by family to choose different sets of 5 fonts. font.axis The font to be used for axis annotation. font.lab The font to be used for x and y labels. font.main The font to be used for plot main titles. font.sub The font to be used for plot sub-titles. lab A numerical vector of the form c(x, y, len) which modifies the default way that axes are annotated. The values of x and y give the (approximate) number of tickmarks on the x and y axes and len specifies the label length. The default is c(5, 5, 7). Note that this only affects the way the parameters xaxp and yaxp are set when the user coordinate system is set up, and is not consulted when axes are drawn. len is unimplemented in R. las numeric in {0,1,2,3}; the style of axis labels. 0: always parallel to the axis [default], 1: always horizontal, 2: always perpendicular to the axis, 3: always vertical. Also supported by mtext. Note that string/character rotation via argument srt to par does not affect the axis labels. lend The line end style. This can be specified as an integer or string: 0 and "round" mean rounded line caps [default]; 1 and "butt" mean butt line caps; 2 and "square" mean square line caps. lheight The line height multiplier. The height of a line of text (used to vertically space multi-line text) is found by multiplying the character height both by the current character expansion and by the line height multiplier. Default value is 1. Used in text and strheight. ljoin The line join style. This can be specified as an integer or string: 0 and "round" mean rounded line joins [default]; 1 and "mitre" mean mitred line joins; 2 and "bevel" mean bevelled line joins. lmitre The line mitre limit. This controls when mitred line joins are automatically converted into bevelled line joins. The value must be larger than 1 and the default is 10. Not all devices will honour this setting.

72 | P a g e

lty The line type. Line types can either be specified as an integer (0=blank, 1=solid (default), 2=dashed, 3=dotted, 4=dotdash, 5=longdash, 6=twodash) or as one of the character strings "blank", "solid", "dashed", "dotted", "dotdash", "longdash", or "twodash", where "blank" uses invisible lines (i.e., does not draw them). Alternatively, a string of up to 8 characters (from c(1:9, "A":"F")) may be given, giving the length of line segments which are alternatively drawn and skipped. See section Line Type Specification. Some functions such as lines accept a vector of values which are recycled. Other uses will take just the first value if a vector of length greater than one is supplied. lwd The line width, a positive number, defaulting to 1. The interpretation is device-specific, and some devices do not implement line widths less than one. (See the help on the device for details of the interpretation.) Some functions such as lines accept a vector of values which are recycled. Other uses will take just the first value if a vector of length greater than one is supplied. mai A numerical vector of the form c(bottom, left, top, right) which gives the margin size specified in inches. mar A numerical vector of the form c(bottom, left, top, right) which gives the number of lines of margin to be specified on the four sides of the plot. The default is c(5, 4, 4, 2) + 0.1. mex mex is a character size expansion factor which is used to describe coordinates in the margins of plots. Note that this does not change the font size, rather specifies the size of font (as a multiple of csi) used to convert between mar and mai, and between oma and omi. This starts as 1 when the device is opened, and is reset when the layout is changed (alongside resetting cex). mfcol, mfrow A vector of the form c(nr, nc). Subsequent figures will be drawn in an nr-by-nc array on the device by columns (mfcol), or rows (mfrow), respectively. In a layout with exactly two rows and columns the base value of "cex" is reduced by a factor of 0.83: if there are three or more of either rows or columns, the reduction factor is 0.66. Setting a layout resets the base value of cex and that of mex to 1. If either of these is queried it will give the current layout, so querying cannot tell you the order in which the array will be filled. Consider the alternatives, layout and split.screen. mfg A numerical vector of the form c(i, j) where i and j indicate which figure in an array of figures is to be drawn next (if setting) or is being drawn (if enquiring). The array must already have been set by mfcol or mfrow. For compatibility with S, the form c(i, j, nr, nc) is also accepted, when nr and nc should be the current number of rows and number of columns. Mismatches will be ignored, with a warning. mgp The margin line (in mex units) for the axis title, axis labels and axis line. Note that mgp[1] affects title whereas mgp[2:3] affect axis. The default is c(3, 1, 0). mkh The height in inches of symbols to be drawn when the value of pch is an integer. Completely ignored in R. new logical, defaulting to FALSE. If set to TRUE, the next high-level plotting command (actually plot.new) should not clean the frame before drawing as if it were on a new device. It is an error (ignored with a warning) to try to use new = TRUE on a device that does not currently contain a high-level plot. oma A vector of the form c(bottom, left, top, right) giving the size of the outer margins in lines of text. omd A vector of the form c(x1, x2, y1, y2) giving the region inside outer margins in NDC (= normalized device coordinates), i.e., as a fraction (in [0, 1]) of the device region. omi A vector of the form c(bottom, left, top, right) giving the size of the outer margins in inches. pch Either an integer specifying a symbol or a single character to be used as the default in plotting points. See points for possible values and their interpretation. Note that only integers and single-character strings can be set as a graphics parameter (and not NA nor NULL). pin The current plot dimensions, (width,height), in inches. plt A vector of the form c(x1, x2, y1, y2) giving the coordinates of the plot region as fractions of the current figure region. ps integer; the point size of text (but not symbols). Unlike the pointsize argument of most devices, this does not change the relationship between mar and mai (nor oma and omi). What is meant by point size is device-specific, but most devices mean a multiple of 1bp, that is 1/72 of an inch. pty A character specifying the type of plot region to be used; "s" generates a square plotting region and "m" generates the maximal plotting region.

73 | P a g e

smo (Unimplemented) a value which indicates how smooth circles and circular arcs should be. srt The string rotation in degrees. See the comment about crt. Only supported by text. tck The length of tick marks as a fraction of the smaller of the width or height of the plotting region. If tck >= 0.5 it is interpreted as a fraction of the relevant side, so if tck = 1 grid lines are drawn. The default setting (tck = NA) is to use tcl = -0.5. tcl The length of tick marks as a fraction of the height of a line of text. The default value is -0.5; setting tcl = NA sets tck = -0.01 which is S' default. usr A vector of the form c(x1, x2, y1, y2) giving the extremes of the user coordinates of the plotting region. When a logarithmic scale is in use (i.e., par("xlog") is true, see below), then the x-limits will be 10 ^ par("usr")[1:2]. Similarly for the y-axis. xaxp A vector of the form c(x1, x2, n) giving the coordinates of the extreme tick marks and the number of intervals between tick-marks when par("xlog") is false. Otherwise, when log coordinates are active, the three values have a different meaning: For a small range, n is negative, and the ticks are as in the linear case, otherwise, n is in 1:3, specifying a case number, and x1 and x2 are the lowest and highest power of 10 inside the user coordinates, 10 ^ par("usr")[1:2]. (The "usr" coordinates are log10-transformed here!) n=1 will produce tick marks at 10^j for integer j, n=2 gives marks k 10^j with k in {1,5}, n=3 gives marks k 10^j with k in {1,2,5}. See axTicks() for a pure R implementation of this. This parameter is reset when a user coordinate system is set up, for example by starting a new page or by calling plot.window or setting par("usr"): n is taken from par("lab"). It affects the default behaviour of subsequent calls to axis for sides 1 or 3. xaxs The style of axis interval calculation to be used for the x-axis. Possible values are "r", "i", "e", "s", "d". The styles are generally controlled by the range of data or xlim, if given. Style "r" (regular) first extends the data range by 4 percent at each end and then finds an axis with pretty labels that fits within the extended range. Style "i" (internal) just finds an axis with pretty labels that fits within the original data range. Style "s" (standard) finds an axis with pretty labels within which the original data range fits. Style "e" (extended) is like style "s", except that it is also ensures that there is room for plotting symbols within the bounding box. Style "d" (direct) specifies that the current axis should be used on subsequent plots. (Only "r" and "i" styles have been implemented in R.) xaxt A character which specifies the x axis type. Specifying "n" suppresses plotting of the axis. The standard value is "s": for compatibility with S values "l" and "t" are accepted but are equivalent to "s": any value other than "n" implies plotting. xlog A logical value (see log in plot.default). If TRUE, a logarithmic scale is in use (e.g., after plot(*, log = "x")). For a new device, it defaults to FALSE, i.e., linear scale. xpd A logical value or NA. If FALSE, all plotting is clipped to the plot region, if TRUE, all plotting is clipped to the figure region, and if NA, all plotting is clipped to the device region. See also clip. yaxp A vector of the form c(y1, y2, n) giving the coordinates of the extreme tick marks and the number of intervals between tick-marks unless for log coordinates, see xaxp above. yaxs The style of axis interval calculation to be used for the y-axis. See xaxs above. yaxt A character which specifies the y axis type. Specifying "n" suppresses plotting. ylbias A positive real value used in the positioning of text in the margins by axis and mtext. The default is in principle device-specific, but currently 0.2 for all of R's own devices. Set this to 0.2 for compatibility with R < 2.14.0 on x11 and windows() devices. ylog A logical value; see xlog above

74 | P a g e

COLORS List all the available graphics colors sColors colors() List of Colors in [R]

Hexidecimal Color Chart

75 | P a g e

Color Palette palette() The palette is what is supplied to col arguments referenced by number. The palette can be changes to any of the numeric numbers above {colors()[subset of numbers from chart]} and then reset using the "default" argument Default colors: black, red, green3, blue, cyan, magenta, yellow, gray
palette() # obtain the current palette palette(rainbow(6)) # six color rainbow palette() # obtain the current palette palette(colors()[c(1,10,20,30,40,50,60,70,80,90,100)]) palette() # obtain the current palette palette("default") # reset the color palette palette() # obtain the current palette

#11 colors

Changing Colors in Arguments


Example frame() textClick("GGG",colors()[47],4) textClick("GGG",colors()[134],4) textClick("GGG",colors()[500],4) textClick("GGG",colors()[551],4) textClick("GGG",colors()[634],4)

Compare the numbers to the number chart above.

76 | P a g e

Show Some of [R]'s colors by name and color library(DAAG) show.colors(type=c("shades"), order.cols=TRUE) show.colors(type=c("gray"), order.cols=TRUE) show.colors(type=c("singles"), order.cols=TRUE)
EXAMPLE plot(mpg~disp,col="blue4", data=mtcars)#using shades

Preset Palettes rainbow(n, s = 1, v = 1, start = 0, end = max(1,n - 1)/n, alpha = 1) gray.colors(n, start = 0.3, end = 0.9, gamma = 2.2) heat.colors(n, alpha = 1) terrain.colors(n, alpha = 1) topo.colors(n, alpha = 1) cm.colors(n, alpha = 1) Arguments
n s,v end
the number of colors ( 1) to be in the palette. the saturation and value to be used to complete the HSV color descriptions.

start the (corrected) hue in [0,1] at which the rainbow begins.


the (corrected) hue in [0,1] at which the rainbow ends. EXAMPLE frame() terrain.colors(6) textClick("GGG",terrain.colors(7)[1],4) textClick("GGG",terrain.colors(7)[2],4) textClick("GGG",terrain.colors(7)[3],4) textClick("GGG",terrain.colors(7)[4],4) textClick("GGG",terrain.colors(7)[5],4) textClick("GGG",terrain.colors(7)[6],4)

alpha the alpha transparency, a number in [0,1], see argument alpha in hsv.

Change the Background Color par(bg="color") Change the Foreground Color par(fg="color")

Select Random Color Self-created color randomization function ran.col(c(dataframe, vector, number), color.choice = c(colors, rainbow, heat, terrain, topo, cm))
x11(16,8) par(mfrow = c(2,3)) with(mtcars,plot(mpg,disp,pch=19,col=ran.col(mtcars,colors),main="COLORS")) with(mtcars,plot(mpg,disp,pch=19,col=ran.col(mtcars,rainbow),main="RAINBOW")) with(mtcars,plot(mpg,disp,pch=19,col=ran.col(mtcars,heat),main="HEAT")) with(mtcars,plot(mpg,disp,pch=19,col=ran.col(mtcars,terrain),main="TERRAIN")) with(mtcars,plot(mpg,disp,pch=19,col=ran.col(mtcars,topo),main="TOPO")) with(mtcars,plot(mpg,disp,pch=19,col=ran.col(3,cm),main="CM")) ran.col(6,colors) #USING TO SET PALETTE palette() #current palette palette(ran.col(10)) #set palette palette() #current palette with(mtcars,plot(mpg,disp,pch=19,col=cyl,main="COLORS")) palette("default") #return to default

77 | P a g e

Plot two graphs in the same pane (Overlay graphs) par(new=TRUE)


EXAMPLES EXAMPLE A plot(mpg~as.factor(cyl),col="green") par("new"=TRUE) plot(mpg~cyl,col="blue",xlab="",axes=F) EXAMPLE B x11() frame() with(mtcars,plot(mpg~disp)) shapeClick("poly",6,border="red",col="yellow") shapeClick("poly",6,border="red",col="green") shapeClick("poly",6,border="red",col="orange") par(new=T) with(mtcars,plot(mpg~disp))

Building Plot Frames from Pieces plot(x, y, type="n",xlab="",ylab="", axes=F) points(x, y) axis(1) axis(2,at=seq(.2,1.8,.2)) box()
EXAMPLE attach(mtcars) plot(mpg, disp, type="n",xlab="",ylab="", axes=F) points(mpg, disp,col="blue") axis(1,at=seq(0,35,5),col="red",col.axis="green",lwd=3) axis(2,lwd=6) axis(3,seq_along(mpg), c(LETTERS,LETTERS[1:6]), col.axis = "blue") axis(4) box(col="orange",lwd=7) title(main="YEPPER",xlab="OK GUY", ylab="YOU DA MAN",sub="SUBTITLE") detach(mtcars) plot(1:10, xaxt = axis(1, xaxp=c(0, plot(1:10, xaxt = axis(1, xaxp=c(2, Plot Grid Lines "n") 9, 5)) "n") 9, 7))

Plot Grid Lines grid(nx = NULL, ny = nx, col = "lightgray", lty = "dotted",lwd = par("lwd"), equilogs = TRUE)
EXAMPLE frame() grid(col="blue") shapeClick("seg",col="red")

78 | P a g e

Title and Labels for Graphics Type: plot(x, y,main="The Title", xlab="X Axis Label", ylab="Y Axis Label") Where plot is the function, x is the x variable, y is your y variable, The Titleis what the graph will be named, X Axis Label is the name of X axis, and the Y Axis Label is the name of the Y axis. An example of plotting without and with the titles and labels. Note: This can be applied to any graphic: hist(p,main="Parent Agression Levels", xlab="Agression Range", ylab="Numbe rof Occurances")

79 | P a g e

Varying Graphs on a Page #1 layout(matrix(c(), rows, columns) Work on the rows and columns first. This creates the grid by columns is The matrix(c(1,1,2,3)) one box each on the

#=================================================== # VARYING GRAPHS PER PAGE #=================================================== attach(mtcars) #=================================================== layout(matrix(c(1,1,2,3), 2, 2, byrow = TRUE)) hist(wt) work for the matrix See specifications. So for a 2x2 for hist(mpg) my created function doing thisrows hist(disp) quickly. #=================================================== windows() will give graph 1 the first two boxes and graph 2 and 3 layout(matrix(c(2,1,2,3), 2, 2, byrow = TRUE)) bottom, hist(wt) hist(mpg) hist(disp) #=================================================== windows() layout(matrix(c(1,2,3,3), 2, 2, byrow = TRUE)) hist(wt) hist(mpg) hist(disp) #=================================================== windows(h=6,w=8) layout(matrix(c(1,2,3,3,4,5), 3, 2, byrow = TRUE)) hist(wt) hist(mpg) hist(disp) hist(drat) hist(qsec)

Varying Graphs on a Page #2 (controls size of window and layout)


source("C:/Users/Rinker/Desktop/PhD Program/CEP 523-Stat Meth Ed Inference/R Stuff/Scripts/Graphics/Multiple Graphics Function.txt ")

multiG (width, height, columns, rows, matrix) For an example use: EXAMPLE(multiG)

80 | P a g e

Varying Graphs on a Page #3 library(plotrix)

Split the graphics device into a "panel" type layout for a group of plots

panes(mat=NULL,widths=rep(1,ncol(mat)),heights=rep(1,nrow(mat)), nrow=2,ncol=2, mar=c(0,0,1.6,0),oma=c(2.5,1,1,1)) Arguments:

EXAMPLE y<-runif(8) panes(matrix(1:4,nrow=2,byrow=TRUE)) par(mar=c(0,2,1.6,0)) boxplot(y,axes=FALSE) axis(1) box(2) par(mar=c(0,0,1.6,2)) tab.title("Boxplot of y",tab.col="#88dd88") barplot(y,axes=FALSE,col=2:9) axis(4) box() tab.title("Barplot of y",tab.col="#88dd88") par(mar=c(2,2,1.6,0)) pie(y,col=2:9) tab.title("Pie chart of y",tab.col="#88dd88") box() par(mar=c(2,0,1.6,2)) plot(y,xaxs="i",xlim=c(0,9),axes=FALSE,col=2:9) axis(4) box() tab.title("Scatterplot of y",tab.col="#88dd88") # center the title at the left edge of the last plot mtext("Test of panes function",at=0,side=1,line=0.8,cex=1.5) panes(matrix(1:3,ncol=1),heights=c(0.7,0.8,1)) par(mar=c(0,2,2,2)) plot(sort(runif(7)),type="l",axes=FALSE) axis(2,at=seq(0.1,0.9,by=0.2)) box() tab.title("Rising expectations",tab.col="#ee6666") barplot(rev(sort(runif(7))),col="blue",axes=FALSE) axis(2,at=seq(0.1,0.9,by=0.2)) box() tab.title("Diminishing returns",tab.col="#6666ee") par(mar=c(4,2,2,2)) tso<-c(0.2,0.3,0.5,0.4,0.6,0.8,0.1) plot(tso,type="n",axes=FALSE,xlab="") # the following needs a Unicode locale to work points(1:7,tso,pch=c(rep(-0x263a,6),-0x2639),cex=2) axis(1,at=1:7, labels=c("Tuesday","Wednesday","Thursday","Friday", "Saturday","Sunday","Monday")) axis(2,at=seq(0.1,0.9,by=0.2)) box() tab.title("The sad outcome",tab.col="#66ee66") mtext("A lot of malarkey",side=1,line=2.5)

81 | P a g e

Put a Box Around a Figure or a group of Figues box()


#EXAMPLES test<-rnorm(100);plot(test) box("figure", lwd=2) test<-rnorm(100);plot(test) box("outer", lwd =2) par(mfrow = c(2, 2)) plot(test) box("figure", lwd=1) plot(test) box("figure", lwd=1) plot(test) box("figure", lwd=1) plot(test) box("figure", lwd=1) box("outer", lwd =5, col="red")

82 | P a g e

Graph Types
#=========================================================================== # THE DATA #=========================================================================== slices <- c(11, 12,4, 16, 8) N<-sum(slices) Percents<-format(digits=3,(slices/N)*100) lbls <- c("US", "UK", "Australia", "Germany", "France") lbls2<-paste(lbls," ",Percents,"%",sep="") #=========================================================================== # PIE PLOTS (not a prefered method of display) #=========================================================================== windows(height=6,width=10);par(mfrow=c(2,2)) #........................................................................... # TYPE 1 #........................................................................... pie(slices, labels = lbls, main="Pie Chart of Countries") #........................................................................... # TYPE 2 #........................................................................... pie(slices, labels ="", main="Pie Chart of Countries", col=c("blue","red","green","yellow","orange")) legend(1.60,.7,lbls,cex=0.8,fill=c("blue","red","green","yellow","orange")) #........................................................................... # TYPE 3 #........................................................................... pie(slices, labels = lbls2, main="Pie Chart of Countries", col=c("blue","chocolate","red","yellow","bisque")) #........................................................................... # TYPE 4 #........................................................................... pie(slices,labels=paste(Percents,"%",sep=""),main="Pie Chart of Countries", col=c("blue","red","green","yellow","orange")) legend(1.60,.7,lbls,cex=0.8,fill=c("blue","red","green","yellow","orange"))

Pie Graph pie(x,labels,) x is a vector of values labels is a vector of label names. See examples to the right NOTE: Cleveland (1985) States that a pie is a poor Choice for displaying info

3-D Pie Graph pie3D(x,labels,) x is a vector of values labels is a vector of label names. Very similar to pie. See examples to the right

#=========================================================================== # THE DATA #=========================================================================== slices <- c(11, 12,4, 16, 8) N<-sum(slices) Percents<-format(digits=3,(slices/N)*100) lbls <- c("US", "UK", "Australia", "Germany", "France") lbls2<-paste(lbls," ",Percents,"%",sep="") #=========================================================================== # PIE PLOTS (not a prefered method of display) #=========================================================================== library(plotrix);windows(h=6,w=12);par(mfrow=c(1,2)) #........................................................................... # TYPE 1 #........................................................................... pie3D(slices, labels = lbls, main="Pie Chart of Countries") #........................................................................... # TYPE 2 #........................................................................... pie3D(slices, labels ="", main="Pie Chart of Countries", col=c("blue","red","green","yellow","orange")) legend(.47,1,lbls,cex=0.8,fill=c("blue","red","green","yellow","orange")) windows(h=6,w=12);par(mfrow=c(1,2)) #........................................................................... # TYPE 3 #........................................................................... pie3D(slices, labels = lbls2, main="Pie Chart of Countries", col=c("blue","chocolate","red","yellow","bisque"),labelcex=1.1) #........................................................................... # TYPE 4 #........................................................................... pie3D(slices,labels=paste(Percents,"%",sep=""),main="Pie Chart of Countries", col=c("blue","red","green","yellow","orange")) legend(.55,1.05,lbls,cex=0.8,fill=c("blue","red","green","yellow","orange"))

83 | P a g e

Dot Chart dotchart(x,labels,) x is a vector of values labels is a vector of label names. Very similar to pie. See examples to the right NOTE: The dot chart is preferred to the pie graph. It can display everything a pie graph can and then some.

#=========================================================================== # THE DATA #=========================================================================== slices <- c(11, 12,4, 16, 8) N<-sum(slices) Percents<-format(digits=3,(slices/N)*100) lbls <- c("US", "UK", "Australia", "Germany", "France") lbls2<-paste(lbls," ",Percents,"%",sep="") #=========================================================================== # DOT PLOTS (prefered over pie charts) #=========================================================================== windows(h=6,w=12);par(mfrow=c(1,2)) #........................................................................... # Simple 1 #........................................................................... dotchart(slices,labels=lbls2,cex=.7, main="Dot Plot COuntries Comparison", xlab="Corn Production (Millions of Bushels)", col=c("blue","red","darkgreen","black","orange")) #........................................................................... # Simple 2 Colored #........................................................................... dotchart(mtcars$mpg,labels=row.names(mtcars),cex=.7, main="Gas Milage for Car Models", xlab="Miles Per Gallon") #........................................................................... # By Group-Colored(Cylinders) #........................................................................... windows(h=6,w=6);par(mfrow=c(1,1)) x <- mtcars[order(mtcars$mpg),] # sort by mpg x$cyl <- factor(x$cyl) # it must be a factor x$color[x$cyl==4] <- "red" x$color[x$cyl==6] <- "blue" x$color[x$cyl==8] <- "darkgreen" dotchart(x$mpg,labels=row.names(x),cex=.7,groups= x$cyl, main="Gas Milage for Car Models\ngrouped by cylinder", xlab="Miles Per Gallon", gcolor="black", color=x$color) mtext("Cars Grouped by Cylinder", side = 2, line = 2, cex = .7)

StripPlot

library(lattice)

stripplot(factor~numeric)
library(lattice) stripplot(factor(cyl,levels=c("8","4","6"))~mpg,data=mtcars) stripplot(factor(cyl,levels=c("8","6","4"))~mpg,main="Milage by Cylinder Type",ylab="Cylinders",data=mtcars)

84 | P a g e

Venn Diagram 1 #Example:

svenn library(vennecular)

List1 <- c("apple", "apple", "orange", "kiwi", "cherry", "peach") List2 <- c("apple", "orange", "cherry", "tomatoe", "pear", "plum", "plum") Lists <- list(List1, List2) #put the word vectors into a list to supply lapply items <- sort(unique(unlist(Lists))) #put in alphabetical order MAT <- matrix(rep(0, length(items)*length(Lists)), ncol=2) #make a matrix of 0's colnames(MAT) <- paste0("List", 1:2) rownames(MAT) <- items lapply(seq_along(Lists), function(i) { #fill the matrix MAT[items %in% Lists[[i]], i] <<- table(Lists[[i]]) }) MAT #look at the results library(venneuler) v <- venneuler(MAT) plot(v)

Venn Diagram 2

library(gplots)

venn(data, universe=NA, small=0.7, showSetLogicLabel=FALSE, simplify=FALSE, show.plot=TRUE)


Arguments data,x universe small simplify show.plot

Either a list list containing vectors of names or indices of group members, or a data frame containing boolean indicators of group membership Subset of valid name/index elements. Values ignore values in codedata not in this list will be ignored. Use NA to use all elements of data (the default). Character scaling of the smallest group counts Logical flag indicating whether unobserved groups should be omitted. Logical flag indicating whether the plot should be displayed. If false, simply returns the group count matrix.

showSetLogicLabel Logical flag indicating whether the internal group label should be displayed

85 | P a g e

Line Graph See Example Below


cars <- c(1, 3, 6, 4, 9) trucks <- c(2, 5, 4, 5, 12) # Calculate range from 0 to max value of cars and trucks g_range <- range(0, cars, trucks) plot(cars, type="o", col="blue", ylim=g_range, axes=FALSE, ann=FALSE) # Make x axis using Mon-Fri labels axis(1, at=1:5, lab=c("Mon","Tue","Wed","Thu","Fri")) # Make y axis with horizontal labels that display ticks at # every 4 marks. 4*0:g_range[2] is equivalent to c(0,4,8,12). axis(2, las=1, at=4*0:g_range[2]) # Create box around plot box() # Graph trucks with red dashed line and square points lines(trucks, type="o", pch=22, lty=2, col="red") # Create a title with a red, bold/italic font title(main="Autos", col.main="red", font.main=4) # Label the x and y axes with dark green text title(xlab="Days", col.lab=rgb(0,0.5,0)) title(ylab="Total", col.lab=rgb(0,0.5,0)) # Create a legend at (1, g_range[2]) that is slightly smaller # (cex) and uses the same line colors and points used by # the actual plots legend(1, g_range[2], c("cars","trucks"), cex=0.8, col=c("blue","red"), pch=21:22, lty=1:2)

Line Graph With Confidence Interval lineplot.CI(x.factor=, response=, main=" ", data=,xlab="",ylab="") x.factor is grouping variable, response is the numeric measure
EXAMPLE #oneway lineplot.CI(x.factor=cyl, response=mpg, main="MPG by Cylinder Type", data=mtcars, xlab="Cylinders",ylab="mpg") #twoway lineplot.CI(x.factor=cyl, response=mpg,group=am, main="MPG by Cylinder Type", data=mtcars,xlab="Cylinders",ylab="mpg")

86 | P a g e

Line Graph 2 (more for joining points of existing plots) Yellow is the code responsible
EXAMPLE with(mtcars,plot(mpg,hp,main="Norah's Cries", xlab="Time",ylab="Decibals")) sequence<-with(mtcars,order(mpg)) with(mtcars,lines(mpg[sequence],hp[sequence], col="green",lwd=2)) shapeClick("arrow",code=1,col="blue",lwd=2) shapeClick("arrow",code=1,col="blue",lwd=2) shapeClick("arrow",code=1,col="blue",lwd=2) text(locator(1), "Begining RTI!", pos=4) text(locator(1), "It get's worse before", pos=4) text(locator(1), "it gets better!", pos=4) text(locator(1), "Extinction Bursts", pos=4) shapeClick("box",border="blue",lwd=2) shapeClick("box",border="blue",lwd=2) shapeClick("box",border="blue",lwd=2)

Interaction Plot method 1

see also effects below

interaction.plot(x.factor, trace.factor, response, fun = mean, type = c("l", "p", "b"), legend = TRUE, trace.label = deparse(substitute(trace.factor)), fixed = FALSE, xlab = deparse(substitute(x.factor)), ylab = ylabel, ylim = range(cells, na.rm=TRUE), lty = nc:1, col = 1, pch = c(1:9, 0, letters), xpd = NULL, leg.bg = par("bg"), leg.bty = "n", xtick = FALSE, xaxt = par("xaxt"), axes = TRUE, ...)

Arguments:

EXAMPLE:
HW19<-read.table("HW19.csv", header=TRUE, sep=",",na.strings="NA");HW19$Attitude<-as.factor(HW19$Attitude);x11(18,8)

frame() par(mfrow=c(1,3)) with(HW19,interaction.plot(Attitude,Gender,Science.Comprehension,lwd=3,col=c(11,4))) with(HW19,interaction.plot(Attitude,Grade,Science.Comprehension,lwd=3,col=c(6,2))) with(HW19,interaction.plot(Grade,Gender,Science.Comprehension,lwd=3,col=c("orange","purple")))

87 | P a g e

Interaction Plot method 2

library(effects)

plot(effect (term1:term2, fit, list(term3=c(levels))), multiline=TRUE)


EXAMPLE HW19<-read.table("HW19.csv", header=TRUE, sep=",",na.strings="NA");HW19$Attitude<as.factor(HW19$Attitude) fit <- with(HW19, lm(Science.Comprehension~Gender * Attitude * Grade)) attach(HW19) plot(effect("Gender:Attitude", fit, list(Gender=c("f","m"))),multiline=T) plot(effect("Attitude:Grade", fit, list(Grade=c("eight","nine"))),multiline=T) #had to x out of graph plot(effect("Gender:Grade", fit, list(Grade=c("eight","nine"))),multiline=T)

Interaction Plot method 3 interaction2wt(y~x1*x2, data=)

library(HH)

interaction2wt(len~supp*dose, data=ToothGrowth)

88 | P a g e

Plot the columns of one matrix or dataframe against the columns of another matplot(x, y, type = "p", lty = 1:5, lwd = 1, lend = par("lend"), pch = NULL, col = 1:6, cex = NULL, bg = NA, xlab = NULL, ylab = NULL, xlim = NULL, ylim = NULL)
Arguments

what type of plot should be drawn. Possible types are

"p" for points, "l" for lines, "b" for both, "c" for the lines part alone of "b", "o" for both overplotted, "h" for histogram like (or high-density) vertical lines, "s" for stair steps, "S" for other steps, see Details below, "n" for no plotting.

Example x <- as.matrix( EuStockMarkets[1:50,] ) matplot(x,main = "matplot (standard)", xlab = "", ylab = "") matplot(x,type=l,lty=1, main = "matplot (line)", xlab = "", ylab = "") #============================================================= x <- as.data.frame(x) matplot(x,main = "matplot (standard)", xlab = "", ylab = "") matplot(x,type=l,lty=1, main = "matplot (line)", xlab = "", ylab = "")

89 | P a g e

Bar graph barplot(x)

sbar graph m sbarplot ?barplot for details horiz=F yields horizontal bars library(sciplot)

Bar graph with Confidence Intervals

bargraph.CI(x.factor=, response=, main=" ", data=,xlab="",ylab="") x.factor is grouping variable, response is the numeric measure
EXAMPLE bargraph.CI(x.factor=cyl, response=mpg, main="MPG by Cylinder Type", data=mtcars, xlab="Cylinders",ylab="mpg")

3-D Bar Plot


source("C:/Users/Rinker/Desktop/PhD Program/CEP 523-Stat Meth Ed Inference/R Stuff/Scripts/Graphics/3-D Bar plot.txt")

Additional Nicities for Bar Graphs


barplot(VADeaths, beside=TRUE, las=1) abline(h=seq(0, 100, by=1), col="gray90") abline(h=seq(0, 100, by=10), col="gray") par(new=T) barplot(VADeaths, beside=TRUE, las=1)

barplot(VADeaths, beside=TRUE, las=1) abline(h=seq(0, 100, by=5), col="gray90") abline(h=seq(0, 100, by=10), col="gray") par(new=T) barplot(VADeaths, beside=TRUE, las=1)

barplot(VADeaths, beside=TRUE, las=1) abline(h=0:100, col="white") barplot( VADeaths, beside=TRUE, las=1, add=TRUE, col=FALSE )

90 | P a g e

Add Text Directly Below Or Above Bars

mtcars2 <- mtcars[order(-mtcars$mpg), ] par(cex.lab=1, cex.axis=.6, mar=c(6.5, 3, 2, 2) + 0.1, xpd=NA) #shrink

axis text and increase bot. mar.

barX <- barplot(mtcars2$mpg,xlab="Cars", main="MPG of Cars", ylab="", names=rownames(mtcars2), mgp=c(5,1,0), ylim=c(0, 35), las=2, col=mtcars2$cyl) mtext(side=2, text="MPG", cex=1, padj=-2.5) text(cex=.5, x=barX, y=mtcars2$mpg+par("cxy")[2]/2, mtcars2$hp, xpd=TRUE) text(cex=.5, x=barX, y=-.5, mtcars2$gear, xpd=TRUE, col="red")

day <- c(0:28) ndied <- c(342,335,240,122,74,64,49,60,51,44,35,48,41,34,38, 27,29,23,20,15,20,16,17,17,14,10,4,1,2) pdied <- c(19.1,18.7,13.4,6.8,4.1,3.6,2.7,3.3,2.8,2.5,2.0,2.7, 2.3,1.9,2.1,1.5,1.6,1.3,1.1,0.8,1.1,0.9,0.9,0.9, 0.8,0.6,0.2,0.1,0.1) pmort <- data.frame(day,ndied,pdied) barX <- barplot(pmort$pdied,xlab="Age(days)", ylab="Percent", names=pmort$day, xlim=c(0,35),ylim=c(0,20),legend="Mortality") text(cex=.5, x=barX, y=pmort$pdied+par("cxy")[2]/2, pmort$ndied, xpd=TRUE, col='darkgreen') text(cex=.5, x=barX, y=-.5, pmort$ndied, xpd=TRUE, col="blue")

X2sum <- c(42.6, 3.6, 1.8, 3.9, 12.1, 14.3, 14.6 ,28.4) X2.labels <- c("No earnings", "Less than $5000/year", "$5K to $10K" , "$10K to $15K" , "$ 15K to $20K" , "$20K to $25K" , "$25K to $30K", "Over $30K" ) barCenters <- barplot(X2sum) text(barCenters, par("usr")[3] - 0.5, srt = 45, adj = 1, labels =X2.labels, xpd = TRUE, cex=.7)

91 | P a g e

Histogram

shistogram

hist(Set$Attitude, col="purple", breaks=20) hist(Set$Attitude, col="purple") Histogram with kernel density and normal curve histkdnc(variable) library(descr)

Additional arguments are similar to histogram

Histograms for all variables in a data frame or matrix hist.data.frame(data.frame) library(Hmisc)

Histograms with normal curve and density plot for all variables in a data frame or matrix multi.hist(dataframe) library(psych)

92 | P a g e

Density Plot plot(density(d1$mathscore ), main="yes",xlab="bad", ylab="good") #The Plot polygon(density(d1$mathscore ) ,col="orange", border="purple") #Coloring and the Border

Histogram-density plot with qq plot (check normality)


source("C:/Users/Rinker/Desktop/PhD Program/CEP 523-Stat Meth Ed Inference/R Stuff/Scripts/Graphics/ Histogram-density plot with qq plot.txt)

QQhist(x) for an example use: QQhist.fun() Plot 2 or more density plots library(sm) sm.density.compare(num.variable,factor)
*GRAPH IS EMBELISHED WITH MEAN LINES FOR EACH GROUP AND EXTRA MEAN LINES TO EXPLAIN 4 CYL.S BIMODAL GRAPH

EXAMPLE library(sm) mtcars2<-mtcars mtcars2$cyl<-as.factor(with(mtcars,recodeVar(cyl, src=c(4,6,8),tgt=c("four","six","eight"), default=NULL, keep.na=TRUE))) fm<-mean(subset(mtcars2,cyl=="four")$mpg) sm<-mean(subset(mtcars2,cyl=="six")$mpg) em<-mean(subset(mtcars2,cyl=="eight")$mpg) with(mtcars2,sm.density.compare(mpg,cyl)) #plot several densities @ once abline(v=mean(subset(mtcars2,cyl=="four")$mpg)) abline(v=mean(subset(mtcars2,cyl=="six")$mpg)) abline(v=mean(subset(mtcars2,cyl=="eight")$mpg)) # uh oh 4 cyl is bi-modal. Why? #plot means

#plot of means for four cylinder by displacement; #the factor that makes this group's graph bi-modal abline(v=mean(fmDF[c(1,2,3,7,9,11),1]),col="orange") abline(v=mean(fmDF[c(4,5,6,8,10),1]),col="pink") legend(locator(1),c("Four Cylinder","Six Cylinder", "Eight Cylinder", "4cyl low disp", "4cyl high disp"), fill=c("green","blue","red", "orange", "pink"))

93 | P a g e

Histogram with colored tails (2 sd or what ever you set) histogram <- hist(scale(vector)), breaks= , plot=FALSE) plot(histogram, col=ifelse(abs(histogram$breaks) < #of SD, Color 1, Color 2))

Example windows(13,4) par(mfrow=c(1,2)) histograph <- hist(scale(mtcars$mpg), breaks=10, plot=FALSE) plot(histograph, main="Histogram of MPG",col=ifelse(abs(histograph$breaks) < 2, 5, 8)) x <- rnorm(1000) hx <- hist(x, breaks=150, plot=FALSE) plot(hx, col=ifelse(abs(hx$breaks) < 2, 3, 6))

Stem and Leaf Plot stem(x, scale = 1, width = 80, atom = 1e-08) Arguments x scale width atom a numeric vector. This controls the plot length. The desired width of plot. a tolerance.
EXAMPLE: x<-round(mtcars$mpg) stem(x, scale=.5) stem(x, scale=.25)

94 | P a g e

MULTIVARIATE DATA PLOTS Star Plot (Multivariate data) Draw star plots or segment diagrams of a multivariate data set stars(x, full = TRUE, scale = TRUE, radius = TRUE, labels = dimnames(x)[[1]], locations = NULL, nrow = NULL, ncol = NULL, len = 1, key.loc = NULL, key.labels = dimnames(x)[[2]], key.xpd = TRUE, xlim = NULL, ylim = NULL, flip.labels = NULL, draw.segments = FALSE, col.segments = 1:n.seg, col.stars = NA, axes = FALSE, frame.plot = axes, main = NULL, sub = NULL, xlab = "", ylab = "", cex = 0.8, lwd = 0.25, lty = par("lty"), xpd = FALSE, mar = pmin(par("mar"), 1.1+ c(2*axes+ (xlab != ""), 2*axes+ (ylab != ""), 1,0)), add = FALSE, plot = TRUE, ...) ARGUMENTS

example(stars)

95 | P a g e

Chernoff Faces (Multivariate data) faces(data, plot.faces=c(TRUE,FALSE)) library(aplpack) faces2(data, ncol=#,nrow=#) library(TeachingDemos) EXAMPLES:
library(aplpack) a<-aplpack::faces(mtcars,plot.faces=FALSE) win.graph(11,8);par(mar = rep(0, 4),xpd=NA) plot(0:5,0:5,type="n") plot(a) library(TeachingDemos) win.graph(11,8);par(mar = rep(0, 4),xpd=NA) faces2(mtcars[,1:7],ncol=8,nrow=4)

96 | P a g e

+ 0 + 0

Bubble Plot (view multivariate data) bubbleplot


25

30

+ 0 + 0 + 0 + 0 + 0 + 0

Bubles represent Horsepower

Cylinders Four Six Eight

Cylinders One Two Three Four Six Eight

20

plot(y~x1) symbols(y~x1,circles=x2)

mpg

+ 0 0 + + 0 + + 0 0 + 0 + 0 + 0 + 0 + 0 + 0 0 + + 0 + 0+ 0 + 0 + 0 + 0 + 0

15

+ 0

EXAMPLE plot(mpg ~ disp, data = mtcars, pch ="+",col=carb) par(new=T) plot(mpg ~ disp, data = mtcars, pch = "0",col=cyl) with(mtcars,symbols(disp, mpg, circles = hp,add = TRUE)) 100 200 legend(280,34,c("Four","Six","Eight"),fill=c("blue","violet","gray"),title="Cylinders") legend(385,34,c("One","Two","Three","Four","Six","Eight"), fill=palette()[as.numeric(levels(as.factor(mtcars$carb)))],title="Cylinders") textClick(expression("Bubles represent\nHorsepower"),"black",1) shapeClick("box",3) #data represents 5 different variables
10

+0 0 + 300 disp 400

97 | P a g e

3-D Scatterplot (view multivariate data) library(scatterplot3d)

[see also spinable 3-d scatterplot]

scatterplot3d(x, y=NULL, z=NULL, color=par("col"), pch=NULL, main=NULL, sub=NULL, xlim=NULL, ylim=NULL, zlim=NULL, xlab=NULL, ylab=NULL, zlab=NULL, scale.y=1, angle=40,axis=TRUE, tick.marks=TRUE, label.tick.marks=TRUE, x.ticklabs=NULL, y.ticklabs=NULL, z.ticklabs=NULL, y.margin.add=0, grid=TRUE, box=TRUE, lab=par("lab"), lab.z=mean(lab[1:2]), type="p", highlight.3d=FALSE, mar=c(5,3,4,3)+0.1, col.axis=par("col.axis"), col.grid="grey", col.lab=par("col.lab"), cex.symbols=par("cex"), cex.axis=0.8 * par("cex.axis"), cex.lab=par("cex.lab"), font.axis=par("font.axis"),font.lab=par("font.lab"), lty.axis=par("lty"), lty.grid=par("lty"), lty.hide=NULL, lty.hplot=par("lty"), log="")
x y z color pch main sub xlim, ylim, zlim xlab, ylab, zlab scale.y angle axis tick.marks label.tick.marks x.ticklabs, y.ticklabs, z.ticklabs y.margin.add grid box lab lab.z type highlight.3d mar col.axis, col.grid, col.lab cex.symbols, cex.axis, cex.lab font.axis, font.lab lty.axis, lty.grid lty.hide lty.hplot log the coordinates of points in the plot. the y coordinates of points in the plot, optional if x is an appropriate structure. the z coordinates of points in the plot, optional if x is an appropriate structure. colors of points in the plot, optional if x is an appropriate structure. Will be ignored if highlight.3d = TRUE. plotting "character", i.e. symbol to use. an overall title for the plot. sub-title. the x, y and z limits (min, max) of the plot. Note that setting enlarged limits may not work as exactly as expected (a known but unfixed bug). titles for the x, y and z axis. scale of y axis related to x- and z axis. angle between x and y axis (Attention: result depends on scaling). a logical value indicating whether axes should be drawn on the plot. a logical value indicating whether tick marks should be drawn on the plot (only if axis = TRUE). a logical value indicating whether tick marks should be labeled on the plot (only if axis = TRUE and tick.marks = TRUE). vector of tick mark labels. add additional space between tick mark labels and axis label of the y axis a logical value indicating whether a grid should be drawn on the plot. a logical value indicating whether a box should be drawn around the plot. a numerical vector of the form c(x, y, len). The values of x and y give the (approximate) number of tickmarks on the x and y axes. the same as lab, but for z axis. character indicating the type of plot: "p" for points, "l" for lines, "h" for vertical lines to x-y-plane, etc. points will be drawn in different colors related to y coordinates (only if type = "p" or type = "h", else color will be used). On some devices not all colors can be displayed. In this case try the postscript device or use highlight.3d = FALSE. A numerical vector of the form c(bottom, left, top, right) which gives the lines of margin to be specified on the four sides of the plot. the color to be used for axis / grid / axis labels. the magnification to be used for point symbols, axis annotation, labels relative to the current. the font to be used for axis annotation / labels. the line type to be used for axis / grid. line style used to plot non-visible edges (defaults of the lty.axis style) the line type to be used for vertical segments with type = "h". Not yet implemented! A character string which contains "x" (if the x axis is to be logarithmic), "y", "z", "xy", "xz", "yz", "xyz".

EXAMPLE library(scatterplot3d) multiG(16,8,2,1) with(mtcars,scatterplot3d(mpg,disp,hp,color=cyl,pch=19,main="HP, MPG, & DISP BY CYL")) par(mar = rep(0, 4),xpd=NA) legend(locator(1),legend=c("4 cyl","6 cyl", "8 cyl"),fill=c(4,6,8), title="Cylinders") with(mtcars,scatterplot3d(mpg,disp,hp,color=cyl,pch=gear,main="HP, MPG, & DISP BY CYL & GEAR")) par(mar = rep(0, 4),xpd=NA) legend(locator(1),legend=c("3","4", "5"),pch=c(3,4,5), title="Gear Number") #by changing pch and color we're viewing 5 variables simultaneously

98 | P a g e

Spinable 3-d Scatterplot rotate Method 1: plot3d(x, y, z, xlab, ylab, zlab, type = "p", col, size, lwd, radius) Method 2: scatter3d(x, y, z, xlab=deparse(substitute(x)), ylab=deparse(substitute(y)), zlab=deparse(substitute(z)), axis.scales=TRUE, revolutions=0, bg.col=c("white", "black"), axis.col=if (bg.col == "white") c("darkmagenta", "black", "darkcyan") else c("darkmagenta", "white", "darkcyan"), surface.col=c("blue", "green", "orange", "magenta", "cyan", "red", "yellow", "gray"), surface.alpha=0.5, neg.res.col="red", pos.res.col="green", square.col=if (bg.col == "white") "black" else "gray", point.col="yellow", text.col=axis.col, grid.col=if (bg.col == "white") "black" else "gray", fogtype=c("exp2", "linear", "exp", "none"), residuals=(length(fit) == 1), surface=TRUE, fill=TRUE, grid=TRUE, grid.lines=26, df.smooth=NULL, df.additive=NULL, sphere.size=1, threshold=0.01, speed=1, fov=60, fit="linear", groups=NULL, parallel=TRUE, ellipsoid=FALSE, level=0.5, ellipsoid.alpha=0.1, id.method=c("mahal", "xz", "y", "xyz", "identify", "none"), id.n=if (id.method == "identify") Inf else 0, labels=as.character(seq(along=x)), offset = ((100/length(x))^(1/3)) * 0.02, model.summary=FALSE)

EXAMPLES library(rgl) with(mtcars, plot3d(wt, disp, mpg, col=cyl, size=6)) library(Rcmdr) with(mtcars, scatter3d(wt, disp, mpg, col=cyl))

99 | P a g e

Staircase plot (show an increase or a decrease over time) library(plotrix) See examples for best understanding:
EXAMPLE sample_size<-c(500,-72,428,-94,334,-45,-89,200) totals<-c(TRUE,FALSE,TRUE,FALSE,TRUE,FALSE,FALSE,TRUE) labels<-c("Contact list","Uncontactable","","Declined","","Ineligible", "Died","Final sample") #========================================================================== staircase.plot(sample_size,totals,labels,main="Acquisition of the sample", total.col="gray",inc.col=2:5,bg.col="#eeeebb",direction="s") staircase.plot(sample_size,totals,labels,main="Acquisition of the sample", total.col="yellow",inc.col=2:5,bg.col="#eeeebb",direction="e") #========================================================================== sample_size<-c(200,+72,272,+94,366,+45,411) totals2<-c(TRUE,FALSE,TRUE,FALSE,TRUE,FALSE,TRUE) labels2<-c("Begining Level","Humor","","Water","","Candy", "Final Level") #========================================================================== staircase.plot(sample_size,totals2,labels2,main="Energy Level", total.col="gray",inc.col=2:5,bg.col="#eeeebb",direction="s")

100 | P a g e

Pyramid plot (comparing nested groups) library(plotrix) See examples for best understanding:
EXAMPLES x11(15,8) par(mfrow=c(1,2)) xy.pop<-c(3.2,3.5,3.6,3.6,3.5,3.5,3.9,3.7,3.9,3.5,3.2,2.8,2.2,1.8, 1.5,1.3,0.7,0.4) xx.pop<-c(3.2,3.4,3.5,3.5,3.5,3.7,4,3.8,3.9,3.6,3.2,2.5,2,1.7,1.5, 1.3,1,0.8) agelabels<-c("0-4","5-9","10-14","15-19","20-24","25-29","30-34", "35-39","40-44","45-49","50-54","55-59","60-64","65-69","70-74", "75-79","80-44","85+") mcol<-color.gradient(c(0,0,0.5,1),c(0,0,0.5,1),c(1,1,0.5,1),18) fcol<-color.gradient(c(1,1,0.5,1),c(0.5,0.5,0.5,1),c(0.5,0.5,0.5,1),18) #========================================================================== par(mar=pyramid.plot(xy.pop,xx.pop,labels=agelabels, main="Australian population pyramid 2002",lxcol=mcol,rxcol=fcol, gap=0.5,show.values=TRUE)) #========================================================================== # three column matrices avtemp<-c(seq(11,2,by=-1),rep(2:6,each=2),seq(11,2,by=-1)) malecook<-matrix(avtemp+sample(-2:2,30,TRUE),ncol=3) femalecook<-matrix(avtemp+sample(-2:2,30,TRUE),ncol=3) # group by age agegrps<-c("0-10","11-20","21-30","31-40","41-50","51-60", "61-70","71-80","81-90","91+") #========================================================================== oldmar<-pyramid.plot(malecook,femalecook,labels=agegrps, unit="Bowls per month",lxcol=c("#ff0000","#eeee88","#0000ff"), rxcol=c("#ff0000","#eeee88","#0000ff"),laxlab=c(0,10,20,30), raxlab=c(0,10,20,30),top.labels=c("Males","Age","Females"),gap=3) # put a box around it box() # give it a title mtext("Porridge temperature by age and sex of cook",3,2,cex=1.5) # stick in a legend legend(par("usr")[1],11,c("Too hot","Just right","Too cold"), fill=c("#ff0000","#eeee88","#0000ff"))

101 | P a g e

Engelmann-Hecker-Plot

Compare spread of grouped data

library(plotrix)

ehplot(data, groups, intervals=50, offset=0.1, log=FALSE, median=TRUE, box=FALSE, boxborder="grey50", xlab="groups", ylab="values", col="black", ...) Arguments:

Example data(iris);library(plotrix) ehplot(iris$Sepal.Length, iris$Species, intervals=20, cex=1.8, pch=20) ehplot(iris$Sepal.Width, iris$Species, intervals=20, box=TRUE, median=FALSE) ehplot(iris$Petal.Length, iris$Species, pch=17, col="red", log=TRUE) ehplot(iris$Petal.Length, iris$Species, offset=0.06, pch=as.numeric(iris$Species)) # Groups don't have to be presorted: rnd <- sample(150) plen <- iris$Petal.Length[rnd] pwid <- abs(rnorm(150, 1.2)) spec <- iris$Species[rnd] ehplot(plen, spec, pch=19, cex=pwid, col=rainbow(3, alpha=0.6)[as.numeric(spec)])

Created using the panes() function

data(iris) ehplot(iris$Sepal.Length, iris$Species, intervals=20,offset=0.1, cex=1.5, pch=20) tab.title("ehplot 1",tab.col=1) ehplot(iris$Sepal.Width, iris$Species, intervals=20, offset=0.1,box=TRUE, median=FALSE) tab.title("ehplot 2",tab.col=4) ehplot(iris$Petal.Length, iris$Species, pch=17,offset=0.1, col="red", log=TRUE) tab.title("ehplot 3",tab.col=3) ehplot(iris$Petal.Length, iris$Species, offset=0.1, pch=as.numeric(iris$Species)) tab.title("ehplot 4",tab.col=2)

102 | P a g e

Hexbin Plot (visualize data closely clustered data) see also sunflowerplothigh density data
plot(hexbin(x, y, xbins = 30, shape = 1,xbnds = range(x), ybnds = range(y),xlab = NULL, ylab = NULL))
EXAMPLE library(plyr) #contains a large data set begend(baseball) #look at begining and end of data set library(hexbin) with(baseball,plot(hexbin(r,ab)))

Bump chart (looking at how ranks have changed from time 1 to time 2 library(plotrix)
bumpchart(y,top.labels=colnames(y),labels=rownames(y),rank=TRUE,mar=c(2,8,5,8),pch=19,col=par("fg"),lty=1,lwd=1)

Arguments:

EXAMPLE #====================================================================== # percentage of those over 25 years having completed high school # in 10 cities in the USA in 1990 and 2000 educattn<-matrix(c(90.4,90.3,75.7,78.9,66,71.8,70.5,70.4,68.4,67.9, 67.2,76.1,68.1,74.7,68.5,72.4,64.3,71.2,73.1,77.8),ncol=2,byrow=TRUE) rownames(educattn)<-c("Anchorage AK","Boston MA","Chicago IL", "Houston TX","Los Angeles CA","Louisville KY","New Orleans LA", "New York NY","Philadelphia PA","Washington DC") colnames(educattn)<-c(1990,2000) #...................................................................... bumpchart(educattn,main="Rank for high school completion by over 25s") #====================================================================== # now show the raw percentages and add central ticks #====================================================================== bumpchart(educattn,rank=FALSE, main="Percentage high school completion by over 25s",col=rainbow(10)) # margins have been reset, so use par(xpd=TRUE) boxed.labels(1.5,seq(65,90,by=5),seq(65,90,by=5)) par(xpd=FALSE)

103 | P a g e

Heatmap with numbers


x <- "http://datasets.flowingdata.com/ppg2008.csv" nba <- read.csv(x) dst <- dist(nba[1:20, -1],) dst <- data.matrix(dst) dim <- ncol(dst) sdim <- seq_len(dim) image(sdim, sdim, dst, axes = FALSE) axis(1, sdim, nba[1:20,1], cex.axis = 0.5) axis(2, sdim, nba[1:20,1], cex.axis = 0.5) lapply(sdim, function(i){ lapply(sdim, function(j){ txt <- sprintf("%0.1f", dst[i,j]) text(i, j, txt, cex=0.5) }) })

104 | P a g e

Scatterplot Ssymbols plot(Set$Attitude ~ Set$Grade, col="pink") To change the point plot symbols use the pch= argument. Note: the set$attitude and set$grade are merely variable names attached to the data set (data frame).
In Windows the R supports Unicode symbols with the negative sign
plot(1, 1, pch = -0x2665L, cex = 10, xlab = "", ylab = "", col = "firebrick3") points(.8, .8, pch = -0x2642L, cex = 10, col = "firebrick3") points(1.2, 1.2, pch = -0x2640L, cex = 10, col = "firebrick3")

Plot group by color Argument to plot: See example below


EXAMPLE #Plot data by cylinder groups w/ legend with(mtcars,plot(drat,hp,col=c("blue","green","red")[as.numeric(as.factor(cyl))])) legend(locator(1),c("4 cyl","6 cyl","8 cyl"),fill=c("blue","green","red")) #I had to say as.factor for cylinder first because it was actually a numeric variable

Identify Plot Points (Plot point labels and identification) slocate sidentify Locate coordinates of a specific point locator(1)
Note that locator(n) can be used to locate a list of specific points and compile a list. This could be useful for locating extreme or odd values that lie outside the overall scatter or group scatter as in the code below in the example box.

Locate name of a specific points identify(x-vector,y-vector,vector of labels,) Locate names of all points text(x-vector,y-vector,vector of labels,) See example below for details
EXAMPLE #Plot data by cylinder groups w/ legend with(mtcars,plot(drat,hp,col=c("blue","green","red")[as.numeric(as.factor(cyl))])) legend(locator(1),c("4 cyl","6 cyl","8 cyl"),fill=c("blue","green","red")) locator(1) #locate a specific point on the plot (x and y coordinate)

#Example of using locate to create a data frame of coordinates for extreme values outlier<-data.frame(locator(4)[1:2]) outlier #locate certain points by name (adj: -1=right justify, 1=left, .5=centered) with(mtcars,identify(drat,hp,labels=c(rownames(mtcars)),adj=1)) #label the points on the plot with(mtcars,text(drat,hp,labels=c(rownames(mtcars)),cex=.5,adj=c(0,-1)))

105 | P a g e

Interactive coloring of points (click and recolor)


x <- 1:5 plot(x, x, col=ifelse(x==3, "red", "black"), pch=19) plot(x, x, col=ifelse(x==3, "red", "black"), pch=ifelse(x==3, 19, 2), cex=ifelse(x==3, 2, 1)) with(mtcars, plot(hp, disp, pch=19, col=c(ifelse(mpg>25, 'red', 'green')))) #=========================================== n <- 15 x <- rnorm(n) y <- rnorm(n) # Plot the data plot(x,y, pch = 19, cex = 2) # This lets you click on the points you want to change # the color of. Right click and select "stop" when # you have clicked all the points you want pnt <- identify(x, y, plot = F) # This colors those points red points(x[pnt], y[pnt], col = "red", pch = 19, cex = 2) points(x[pnt], y[pnt], col = "green", pch = 17, cex = 1)

106 | P a g e

Flip the x and y axis

library(lattice)

# First make some example data df <- data.frame(name=rep(c("a", "b", "c"), each=5), value=rnorm(15)) # Then try plotting it in both 'orientations' # ... as a dotplot xyplot(value~name, data=df) xyplot(name~value, data=df) # ... or perhaps as a 'box-and-whisker' plot bwplot(value~name, data=df) bwplot(name~value, data=df)

107 | P a g e

Blending Plots in R sblending transparency

set.seed(42) p1 <- hist(rnorm(500,4)) # centered at 4 p2 <- hist(rnorm(500,6)) # centered at 6 plot( p1, col=rgb(0,0,1,1/4), xlim=c(0,10)) # first histogram plot( p2, col=rgb(1,0,0,1/4), xlim=c(0,10), add=T) # second hist #or a=rnorm(1000, 3, 1) b=rnorm(1000, 6, 1) hist(a, xlim=c(0,10), col="red") hist(b, add=T, col=rgb(0, 1, 0, 0.5))

library(ggplot2) path = "http://www-stat.stanford.edu/~tibs/ElemStatLearn/datasets/SAheart.data" saheart = read.table(path, sep=",",head=T,row.names=1) fmla = "chd ~ sbp + tobacco + ldl + adiposity + famhist + typea + obesity" model = glm(fmla, data=saheart, family=binomial(link="logit"), na.action=na.exclude) dframe = data.frame(chd=as.factor(saheart$chd), prediction=predict(model, type="response")) ggplot(dframe, aes(x=prediction, colour=chd)) + geom_density() ggplot(dframe, aes(x=prediction, fill=chd)) + geom_histogram(position="identity", binwidth=0.05, alpha=0.5)

ggplot(dframe, aes(x=prediction, fill=chd)) + geom_histogram(position="identity", binwidth=0.05, alpha=0.5)

108 | P a g e

Add text to the graph margins

See Example Text-Rect-LineSeg (search this tag)

mtext(text, side = 3, line = 0, outer = FALSE, at = NA, adj = NA, padj = NA, cex = NA, col = NA, font = NA, ...)

See also: Eliminate margins for clickText()

Argument text: a character or expression vector specifying the _text_ to be written. Other objects are coerced by as.graphicsAnnot. side: on which side of the plot (1=bottom, 2=left, 3=top, 4=right). line: on which MARgin line, starting at 0 counting outwards. outer: use outer margins if available. at: give location of each string in user coordinates. If the component of at corresponding to a particular text item is not a finite value (the de fault), the location will be determined by adj. adj: adjustment for each string in reading direction. For strings parallel to the axes, adj = 0 means left or bottom alignment, and adj = 1 means right or top alignment. If adj is not a finite value (the default), the value of par("las") determines the adjustment. For strings plotted parallel to the axis the default is to centre the string . padj: adjustment for each string perpendicular to the reading direction (which is controlled by adj). For strings parallel to the axes, padj = 0 means right or top alignment, and padj = 1 means left or bottom alignment. If padj is not a finite value (the default), the value of par("las") determines the adjustment. For strings plotted perpendicular to the axis the default is to centre the string. cex: character expansion factor. NULL and NA are equivalent to 1.0 . This is an absolute measure, not scaled by par("cex") or by setting par("mfrow") or par("mfcol"). Can be a v ector. col: color to use. Can be a vector. NA values (the default) mean use par("col"). font: font for text. Can be a vector. NA values (the default) mean use par("font").

Plot Curved Text (arc text)


Arguments:

library(plotrix)
EXAMPLES
plot(0,xlim=c(1,5),ylim=c(1,5),main="Test of arctext",xlab="",ylab="", type="n") arctext("bendy like spaghetti",center=c(3,3),col="blue") arctext("bendy like spaghetti",center=c(3,3),radius=1.5,start= pi,cex=2) arctext("bendy like spaghetti",center=c(3,3),radius=0.5, start=pi/2,stretch=1.2)

arctext(x,center=c(0,0),radius=1,start=NA,middle=pi/2,stretch=1,cex=1,...)

109 | P a g e

Add a Table to a Graph library(plotrix) [see also scripts for a click function]add table; table plot addtable2plot(x,y=NULL,table,lwd=par("lwd"),bty="n",bg=par("bg"),cex=1,xjust=0,yjust=1,box.c ol=par("fg"),text.col=par("fg"),display.colnames=TRUE,display.rownames=FALSE,hlines=FALSE,vl ines=FALSE, title=NULL) Arguments:

Plot Lines (verticle, horizontal or sloped) [see line types] abline(various arguments)
EXAMPLE with(mtcars,plot(drat,hp)) abline(h=204) #plot horizontal abline(v=4) #plot verticle abline(a=210,b=20) #plot sloped (a=y intercept; and b=slope)

Plot Lowess Line Example:


with(mtcars,plot(mpg,disp,pch=cyl,col=cyl+2)) with(mtcars,lines(lowess(cbind(mpg,disp)),lwd=2,col="blue"))

Plot a circle

library(plotrix) [see also scripts for a shapeClick function]

draw.circle(x,y,radius,nv=100,border=NULL,col=NA,lty=1,lwd=1) Arguments:

110 | P a g e

Plot a circle inside a square library(plotrix); library(grid) circle square


require(plotrix) require(grid) plot(c(-1, 1), c(-1,1), type = "n", asp=1) rect( -.5, -.5, .5, .5) draw.circle( 0, 0, .5 ) #note asp must be specified

111 | P a g e

Add Math symbols and Expressions to Plot expression() #wrapped in title, text, mtext etc.
EXAMPLE frame() title(expression( "graph of the function f"(x) == sqrt(1+x^2))) text(locator(1),expression(sum(x)/sqrt(n*S^2))) text(locator(1),expression(hat(beta)==-.567)) text(locator(1),expression(hat(Omega)==infinity*frac(x, y))) mtext(expression(Area == pi*r^2),side=2,line=-12) text(locator(1), expression(bar(x) == sum(frac(x[i], n), i==1, n))) mtext(expression(2.3 %+-% 4.5*pi),side=1,line=-5,adj=.7) mtext(expression(bar(xy)!=sum(x[i], i==1, n) ),side=1,line=-5,adj=0) textClick(expression(sum(sum((X[ij]-bar(X))^2)))) textClick(expression(sum(x[i], i=1, n)),"green",3)

Summation

List of Math Symbols to Add


Syntax
x + y x - y x*y x/y x %+-% y x %/% y x %*% y x %.% y x[i] x^2 paste(x, y, z) sqrt(x) sqrt(x, y) x == y x != y x < y x <= y x > y x >= y x %~~% y x %=~% y x %==% y x %prop% y plain(x) bold(x) italic(x) bolditalic(x) symbol(x) list(x, y, z)

Meaning x plus y x minus y juxtapose x and y x forwardslash y x plus or minus y x divided by y x times y x cdot y x subscript i x superscript 2 juxtapose x, y, and z square root of x yth root of x x equals y x is not equal to y x is less than y x is less than or equal to y x is greater than y x is greater than or equal to y x is approximately equal to y x and y are congruent x is defined as y x is proportional to y draw x in normal font draw x in bold font draw x in italic font draw x in bolditalic font draw x in symbol font comma-separated list

Syntax
... cdots ldots x %subset% y x %subseteq% y x %notsubset% y x %supset% y x %supseteq% y x %in% y x %notin% y hat(x) tilde(x) dot(x) ring(x) bar(xy) widehat(xy) widetilde(xy) x %<->% y x %->% y x %<-% y x %up% y x %down% y x %<=>% y x %=>% y x %<=% y x %dblup% y x %dbldown% y alpha omega Alpha Omega

Meaning ellipsis (height varies) ellipsis (vertically centred) ellipsis (at baseline) x is a proper subset of y x is a subset of y x is not a subset of y x is a proper superset of y x is a superset of y x is an element of y x is not an element of y x with a circumflex x with a tilde x with a dot x with a ring xy with bar xy with a wide circumflex xy with a wide tilde x double-arrow y x right-arrow y x left-arrow y x up-arrow y x down-arrow y x is equivalent to y x implies y y implies x x double-up-arrow y x double-down-arrow y Greek symbols uppercase Greek symbols

112 | P a g e

Syntax
theta1, phi1, sigma1, omega1 Upsilon1 aleph infinity partialdiff nabla 32*degree 60*minute 30*second displaystyle(x) textstyle(x) scriptstyle(x) scriptscriptstyle(x) underline(x) x ~~ y x + phantom(0) + y x + over(1, phantom(0)) frac(x, y) over(x, y) atop(x, y) sum(x[i], i==1, n) prod(plain(P)(X==x), x) integral(f(x)*dx, a, b) union(A[i], i==1, n) intersect(A[i], i==1, n) lim(f(x), x %->% 0) min(g(x), x > 0) inf(S) sup(S) x^y + z x^(y + z) x^{y + z} group("(",list(a, b),"]") bgroup("(",atop(x,y),")") group(lceil, x, rceil)

Meaning cursive Greek symbols capital upsilon with hook first letter of Hebrew alphabet infinity symbol partial differential symbol nabla, gradient symbol 32 degrees 60 minutes of angle 30 seconds of angle draw x in normal size (extra spacing) draw x in normal size draw x in small size draw x in very small size draw x underlined put extra space between x and y leave gap for "0", but don't draw it leave vertical gap for "0" (don't draw) x over y x over y x over y (no horizontal bar) sum x[i] for i equals 1 to n product of P(X=x) for all values of x definite integral of f(x) wrt x union of A[i] for i equals 1 to n intersection of A[i] limit of f(x) as x tends to 0 minimum of g(x) for x greater than 0 infimum of S supremum of S normal operator precedence visible grouping of operands invisible grouping of operands specify left and right delimiters use scalable delimiters special delimiters

113 | P a g e

Expressions in Titles (method 1)


a<-5 b<-1 plot(1:10, main=bquote(p==.(a) *"," ~q==.(b)))

Expressions in Titles (method 2)


a<-5 b<-1 plot(1:10, main = substitute(paste(p == a,, ", ", q == b), list(a = a, b = b)))

114 | P a g e

Control Margins & Eliminate Margins par(mar=c(0,0,0,0) par(mar = rep(0, 4)) #standard #no margin

(use locator with text to put text outside plot) see also clickText() below

#Margins Control Examples #======================================================== #Standard Margins x11() frame() par(oma = c(0,0,0,0)) grid() #======================================================== #Attempt to add text to outer margins using locator (Wrong way) x11() frame() par(oma = c(0,0,0,0)) with(mtcars,plot(mpg~wt)) text(locator(1),expression(beta==3)) #The mistake: tried to add text w/o changing par(mar = rep(0, 4)) text(locator(1),expression(beta==3)) #======================================================== #Add text using locator to outer margins (Correct way) x11() frame() par(oma = c(0,0,0,0)) with(mtcars,plot(mpg~wt)) #Correct way: 1)plot; 2)change margins; 3)add text par(mar = rep(0, 4)) text(locator(1),expression(beta==3)) #======================================================== #Plot with no margins x11() frame() par(mar = rep(0, 4)) with(mtcars,plot(mpg~wt)) text(locator(1),expression(beta==3))

margins first

Function for adding text or expression anywhere with locator clickText


source("C:/Users/Rinker/Desktop/PhD Program/CEP 523-Stat Meth Ed Inference/R Stuff/Scripts/Graphics/Click text")

clickText(expression, col, font size)) expression must either be in quotes if text or an expression() Text Outside Margins clickText() utilizes this par(xpd=NA)

115 | P a g e

Normal curve with upper/lower shaded (uses the polygon function) Example
xv<-seq(-3,3,.01) yv<-dnorm(xv) windows(h=5,w=11) par(mfrow=c(1,2)) plot(xv,yv,type="l",main="2 Standard Deviation") polygon(c(xv[xv<=-2],-2),c(yv[xv<=-2],yv[xv==-3]),col="blue",border="green") polygon(c(xv[xv>=2],2),c(yv[xv>=2],yv[xv==3]),col="blue",border="green") plot(xv,yv,type="l",main="1 Standard Deviation") polygon(c(xv[xv<=-1],-1),c(yv[xv<=-1],yv[xv==-3]),col="red",border="orange") polygon(c(xv[xv>=1],1),c(yv[xv>=1],yv[xv==3]),col="red",border="orange")

Use the mouse to add lines segments, rectangles, arrows and polygons
source("C:/Users/Rinker/Desktop/PhD Program/CEP 523-Stat Meth Ed Inference/R Stuff/Scripts/Graphics/Click Shapes.txt")

shapeClick(shape="arrow", corners=NULL,col=NULL, border = NULL, lty = par("lty"), lwd = par("lwd"),code=2) examples code and use

source("C:/Users/Rinker/Desktop/PhD Program/CEP 523-Stat Meth Ed Inference/R Stuff/Scripts/Graphics/Click Shapes.txt") with(mtcars,plot(mpg,disp)) shapeClick("seg",col="red") shapeClick("box",col="yellow",border="red",lty=1) shapeClick("arrow",col="orange",lwd=4,lty=55) shapeClick("arrow",code=3,col="blue",lwd=2) shapeClick("poly",3,col="yellow",border="red",lwd=2) shapeClick("poly",5,border="green",lwd=2)

Add Rug to Graph (Shows the actual data points) rug(variable) 116 | P a g e
# like a line this is done after the graph is plotted

x11() with(mtcars,plot(mpg~disp)) rug(mtcars$disp);rug(mtcars$mpg,side=2)

Add line segments to the graph

See Example Text-Rect-LineSeg Below

segments(x0, y0, x1, y1,col = par("fg"), lty = par("lty"), lwd = par("lwd"), ...) x0,y0,x1,y1 are coordinates of the start and end points Add rectangles to the graph See Example Text-Rect-LineSeg rect(xleft, ybottom, xright, ytop, angle = 45, col = NULL, border = NULL, lty = NULL, lwd =)
#================================================================================================= # VARIOUS TEXT ARGUMENTS #================================================================================================= windows(h=6.5,w=10);par(mfrow=c(2,3)) plot(1:10, (-4:5)^2, main="Parabola Points", xlab="xlab") text(3,23, "pos 1", pos=1);text(3,23, "pos 2", pos=2) text(3,23, "pos 3", pos=3);text(3,23, "pos 4", pos=4) plot(1:10, (-4:5)^2, main="Parabola Points", xlab="xlab") text(locator(1), "yippee", pos=1) plot(1:10, (-4:5)^2, main="Parabola Points", xlab="xlab") mtext("line -2", line = -2);mtext("line 2", line = 2) mtext("line 3", line = 3);mtext("line -6", line = -6) plot(1:10, (-4:5)^2, main="Parabola Points", xlab="xlab") mtext("adj 0", line = -2, adj =0);mtext("adj .5", line = -2, adj =.5) mtext("adj 1", line = -2, adj =1) plot(1:10, (-4:5)^2, main="Parabola Points", xlab="xlab") mtext("side 1",side=1);mtext("side 2",side=2) mtext("side 3",side=3);mtext("side 4",side=4) plot(1:10, (-4:5)^2, main="Parabola Points", xlab="xlab") mtext("side4;adj1",side=4,adj=1,col="blue") mtext("side2;adj0",side=2,adj=.5,col="red") mtext("side4;adj.5",side=4,adj=0,col="green") mtext("side4;adj.5;padj1",side=4,adj=.5,padj=1,col="purple") mtext("side4;adj.5;padj-2",side=4,adj=.5,padj=-2,col="orange") #================================================================================================= # VARIOUS RECTANGLE USES/ARGUMENTS #================================================================================================= windows(h=6,w=6);par(mfrow=c(1,1)) plot(1:10, (-4:5)^2, main="Parabola Points", xlab="xlab") rect(6, 10, 8, 12, angle = 45,col = NULL, border = "red", lty = 1, lwd = 2) text(6.16,11, "HELLO!", pos=4) rect(4, 15, 8, 20, angle = 45,col = NULL, border = "blue", lty = 1, lwd = 4) text(locator(1), "HELLO!", pos=4,cex=2) rect(8, 0, 10, 3, angle = 45,col = "black", border = "blue", lty = 1, lwd = 4) text(locator(1), "HELLO!", pos=4,cex=.8,col="white") #================================================================================================= # VARIOUS LINE SEGMENT USES/ARGUMENTS #================================================================================================= segments(4, 0, 10, 10,col = "orange", lty = 1, lwd = 1) segments(2, 25, 8, 25,col = "blue", lty = 2, lwd = 1) segments(2, 0, 2, 25,col = "yellow", lty = 1, lwd = 3) #================================================================================================= # USING TEXT TO CREATE A LABEL (In action) #================================================================================================= windows(h=6,w=6);par(mfrow=c(1,1))

Example Text-Rect-LineSeg

x <- mtcars[order(mtcars$mpg),] # sort by mpg x$cyl <- factor(x$cyl) # it must be a factor x$color[x$cyl==4] <- "red" x$color[x$cyl==6] <- "blue" x$color[x$cyl==8] <- "darkgreen" dotchart(x$mpg,labels=row.names(x),cex=.7,groups= x$cyl, main="Gas Milage for Car Models\ngrouped by cylinder", xlab="Miles Per Gallon", gcolor="black", color=x$color) mtext("Cars Grouped by Cylinder", side = 2, line = 2, cex = .7)

Add Arrows to Graph arrows(x0,y0,x1,y1)

117 | P a g e

Add a Legend to a Graph sLEGEND

EXAMPLE
legend(locator(1),c("Grazed","Ungrazed"),fill=c("blue","darkgreen"))

legend(x, y = NULL, legend, fill = NULL, col = par("col"), border="black", lty, lwd, pch, angle = 45, density = NULL, bty = "o", bg = par("bg"), box.lwd = par("lwd"), box.lty = par("lty"), box.col = par("fg"), pt.bg = NA, cex = 1, pt.cex = cex, pt.lwd = lwd, xjust = 0, yjust = 1, x.intersp = 1, y.intersp = 1, adj = c(0, 0.5), text.width = NULL, text.col = par("col"), merge = do.lines && has.pch, trace = FALSE, plot = TRUE, ncol = 1, horiz = FALSE, title = NULL, inset = 0, xpd, title.col = text.col, title.adj = 0.5, seg.len = 2)
Description: This function can be used to add legends to plots. Note that a call to the function locator(1) can be used in place of the x and y arguments.

Arguments: x, y: the x and y co-ordinates to be used to position the legend. They can be specified by keyword or in any way which is accepted by xy.coords: See Details. fill: if specified, this argument will cause boxes filled with the specified colors (or shaded in the specified colors) to appear beside the legend text. col: the color of points or lines appearing in the legend. border: the border color for the boxes (used only if fill is specified). lty, lwd: the line types and widths for lines appearing in the legend. One of these two _must_ be specified for line drawing. ch: the plotting symbols appearing in the legend, either as vector of 1-character strings, or one (multi character) string. _Must_ be specified for symbol drawing. angle: angle of shading lines. density: the density of shading lines, if numeric and positive. If NULL or negative or NA color filling is assumed. bty: the type of box to be drawn around the legend. The allowed values are "o" (the default) and "n". bg: the background color for the legend box. (Note that this is only used if bty != "n".) box.lty, box.lwd, box.col: the line type, width and color for the legend box (if bty = "o"). pt.bg: the background color for the points, corresponding to its argument bg. cex: character expansion factor *relative* to current par("cex"). Used for text, and provides the default for pt.cex and title.cex. pt.cex: expansion factor(s) for the points. pt.lwd: line width for the points, defaults to the one for lines, or if that is not set, to par("lwd"). xjust: how the legend is to be justified relative to the legend x location. A value of 0 means left justified, 0.5 means centered and 1 means right justified. yjust: the same as xjust for the legend y location. x.intersp: character interspacing factor for horizontal (x) spacing. y.intersp: the same for vertical (y) line distances. adj: numeric of length 1 or 2; the string adjustment for legend text. Useful for y-adjustment when labels are plotmath expressions. text.width: the width of the legend text in x ("user") coordinates. (Should be positive even for a reversed x axis.) Defaults tothe proper value computed by strwidth(legend). text.col: the color used for the legend text. merge: logical; if TRUE, merge points and lines but not filledboxes. Defaults to TRUE if there are points and lines. trace: logical; if TRUE, shows how legend does all its magical computations. plot: logical. If FALSE, nothing is plotted but the sizes are returned. ncol: the number of columns in which to set the legend items (default is 1, a vertical legend). horiz: logical; if TRUE, set the legend horizontally rather than vertically (specifying horiz overrides the ncol specification). title: a character string or length-one expression giving a title to be placed at the top of the legend. Other objects will be coerced by as.graphicsAnnot. inset: inset distance(s) from the margins as a fraction of the plot region when legend is placed by keyword. xpd: if supplied, a value of the graphical parameter xpd to beused while the legend is being drawn. title.col: color for title. title.adj: horizontal adjustment for title: see the help for par("adj"). seg.len: the length of lines drawn to illustrate lty and/or lwd (in units of character widths).

118 | P a g e

Draw a cylinder

library(plotrix)

cylindrect(xleft,ybottom,xright,ytop,col,border=NA,gradient="x",nslices=50) Arguments:

example(cylindrect) Included in the shapeClick function in scripts Inset a break in a scale (broken scale) library(plotrix)

axis.break(axis=1,breakpos=NULL,pos=NA,bgcol="white",breakcol="black",style="slash",brw=0.02)

Arguments:

EXAMPLE #=========================================== x11(12,8) par(mfrow=c(1,2)) #=========================================== barplot(tN, col=heat.colors(12), log = "y") axis.break(axis=2,breakpos=4,style="zigzag") axis.break(axis=2,breakpos=9,style="zigzag") #=========================================== plot(mpg~cyl,mtcars) axis.break(breakpos=4.5,axis=1)

119 | P a g e

Plot with a zoomed in plot side by side

library(plotrix)

zoomInPlot(x,y=NULL,xlim=NULL,ylim=NULL,rxlim=xlim, rylim=ylim,xend=NA,zoomtitle=NULL,titlepos=NA,...)

Arguments:

Scatterplot w/ histogram, correlation, density, elipse

library(psych)

scatter.hist (x, y = NULL, smooth = TRUE, ab = FALSE, correl = TRUE, density = TRUE, ellipse = TRUE, digits = 2, cex.cor = 1, title = "Scatter plot + histograms", xlab = NULL, ylab = NULL)

120 | P a g e

Design Plots (compare mean differences, sd, var, or medians) [effects] plot.design(y~x1*x2*xn,fun="mean") fun arguments: mean, median, sd, var
EXAMPLE
dat<-read.table("HW19.csv", header=TRUE, sep=",",na.strings="999") dat$Attitude<-as.factor(dat$Attitude) mod<-lm(Science.Comprehension~Gender*Attitude*Grade,data=dat) anova(mod) windows(h=6,w=10) par(mfrow=c(1,2)) with(dat,plot.design(Science.Comprehension~Gender*Attitude*Grade, main="Mean Differences")) with(dat,plot.design(Science.Comprehension~Gender*Attitude*Grade, fun="sd",main="SD Comparisons"))

BWplot Compares means and spread in a multi box plot format) bwplot(formula,data)
Formula is in the form y ~ x | g1 * g2 * ... (or equivalently, y ~ x | g1 + g2 + ...), indicating that plots of y (on the y-axis) versus x (on the x-axis) should be produced conditional on the variables g1, g2, .... Here x and y are the primary variables, and g1, g2, ... are the conditioning variables. The conditioning variables may be omitted to give a formula of the form y ~ x, in which case the plot will consist of a single panel with the full dataset. The formula can also involve expressions, e.g., sqrt(), log(), etc.
EXAMPLE
dat<-read.table("HW19.csv", header=TRUE, sep=",",na.strings="999") dat$Attitude<-as.factor(dat$Attitude) mod<-lm(Science.Comprehension~Gender*Attitude*Grade,data=dat) anova(mod) library(lattice) trellis.par.set(col.whitebg()) bwplot(Science.Comprehension~Gender|Attitude*Grade,dat)

121 | P a g e

Box plot with confidence intervals library(psych) boxplot(attitude,notch=F,main="Boxplot with error bars") error.bars(attitude,add=TRUE)

boxplot(attitude,notch=T,main="Notched boxplot with error bars") error.bars(attitude,add=TRUE)

Spaghetti Plot for Repeated Measures Data library(lattice) xyplot(y ~ x, groups =, type = "b", data=)
source("C:/Users/Rinker/Desktop/PhD Program/CEP 523-Stat Meth Ed Inference/R Stuff/Scripts/Data Sets.txt") library(reshape) rep.mes2<-rep.mes Sex<-gl(2, 25, length=50,labels = c("Male", "Female")) rep.mes2<-data.frame(rep.mes2[1:2],Sex,rep.mes2[3:5]) long.rep.mes<-melt(rep.mes2,id=1:3)[order(melt(rep.mes)$Sub),] rownames(long.rep.mes)<-1:150 rep.mes2;long.rep.mes library(lattice) xyplot(value[1:18] ~ variable, groups = Sub, type = "b",data=long.rep.mes) xyplot(value ~ variable, groups = Group, type = "b",data=long.rep.mes) xyplot(value ~ variable, groups = Sex, type = "b",data=long.rep.mes)

Coplots coplot(formula,data,)
EXAMPLE coplot(Sepal.Width~ Petal.Length|Petal.Width,data=iris,panel=panel.smooth)

122 | P a g e

Change Scale Text Direction las=either 0,1,2,3 NOTE: This is an argument to a plot.
EXAMPLE par(mfrow=c(2,2)) with(iris,plot(Sepal.Length,Sepal.Width,pch=a s.numeric(Species),col=as.numeric(Species))) with(iris,plot(Sepal.Length,Sepal.Width,pch=a s.numeric(Species),las=1,col=as.numeric(Speci es))) with(iris,plot(Sepal.Length,Sepal.Width,pch=a s.numeric(Species),las=2,col=as.numeric(Speci es))) with(iris,plot(Sepal.Length,Sepal.Width,pch=a s.numeric(Species),las=3,col=as.numeric(Speci es)))

123 | P a g e

Overplotting
# Generate some data library(MASS) set.seed(101) n <- 50000 X <- mvrnorm(n, mu=c(.5,2.5), Sigma=matrix(c(1,.6,.6,1), ncol=2)) # A color palette from blue to yellow to red library(RColorBrewer) k <- 11 my.cols <- rev(brewer.pal(k, "RdYlBu")) ## compute 2D kernel density, see MASS book, pp. 130-131 z <- kde2d(X[,1], X[,2], n=50) # Make the base plot plot(X, xlab="X label", ylab="Y label", pch=19, cex=.4) # Draw the colored contour lines contour(z, drawlabels=FALSE, nlevels=k, col=my.cols, add=TRUE, lwd=2) # Make points smaller - use a single pixel as the plotting charachter plot(X, pch=".") # Hexbinning library(hexbin) plot(hexbin(X[,1], X[,2])) # Make points semi-transparent library(ggplot2) qplot(X[,1], X[,2], alpha=I(.1)) # The smoothScatter function (graphics package) smoothScatter(X)

124 | P a g e

List of Graph Functions in Base

125 | P a g e

List of Graph Functions in Lattice

126 | P a g e

Graphics Arguments (Parameters)

127 | P a g e

128 | P a g e

129 | P a g e

Descriptive Statistics Find standard deviation, variance, mean, median, range, & standard error From the stats package: sd() var() mean() median() max() min() summary() From the library(plotrix): std.error(x,na.rm) #--> example: std.error(mtcars$mpg,na.rm) Descriptives library(psych) describe(x, na.rm = TRUE, interp=FALSE,skew = TRUE, ranges = TRUE,trim=.1)

Descriptives by Group (Note: Pairwise deletion; this will include as much data as possible) library(psych) describe.by(dataset, variable1) describe.by(dataset, variable2) describe.by(dataset, list(variable1,variable2)) #Does Descripts on Variable 1 #Does Descripts on Variable 2 #Does All the Interactions

library(psych) #example: g4<-read.table("g4.csv", header=TRUE, sep=",",na.strings="999") describe.by(g4,g4$gender) describe.by(g4,g4$race) describe.by(g4,list(g4$gender,g4$race)) #Does Descripts on Variable 1 #Does Descripts on Variable 2 #Does All the Interactions

NOTE: I created functions to automate this for the .Regression Bundle script. desc2v() for 2 variables; desc3v() for 3 variables use rfun() to view the list of functions and arguments in the regression bundle
sd use: data.frame(cv[[A]][[B]])[c(desired variable rows),c(2,3,4)] to extract certain groups,columns & rows [[A]] "DESCRIPTIVES FOR VARIABLE 1"=[[1]] DESCRIPTIVES FOR VARIABLE 2"=[[2]] "DESCRIPTIVES FOR VARIABLE 3"=[[3]] "DESCRIPTIVES FOR INTERACTION OF VARIABLE 1 & 2"=[[4]] "DESCRIPTIVES FOR INTERACTION OF VARIABLE 2 & 3"=[[5]] "DESCRIPTIVES FOR INTERACTION OF VARIABLE 1 & 3"=[[6]] "DESCRIPTIVES FOR INTERACTION OF VARIABLE 1,2,&3[[7]] [[B]] Level of group or interaction: ie. a [[A=1]] with 2 groups=[[B n of 2]] a [[A=1]] with 3 groups=[[B n of 3]] a [[A=4]] with 2 and 3 groups=[[B n of 8]] a [[A=7]] with 2, 2 and 3 groups=[[B n of 19]] 2,3,4 here gives n, mean &

130 | P a g e

Descriptives by Group (this relies on specific variables of a data set; more manageable info)
library(doBy) summaryBy()

examples:
#EXAMPLE of summaryBy() g4<-read.table("g4.csv", header=TRUE, sep=",",na.strings="999") library(doBy) summaryBy(mathscore + effort+ initiative+valueing ~ race, data = FUN = function(x) { c(n = length(x),mean = mean(x), sd = sd(x)) summaryBy(mathscore + effort+ initiative+valueing ~ gender, data FUN = function(x) { c(n = length(x),mean = mean(x), sd = sd(x)) summaryBy(mathscore + effort+ initiative+valueing ~ gender+race, FUN = function(x) { c(n = length(x),mean = mean(x), sd = sd(x))

g4, } ) = g4, } ) data = g4, } )

EXAMPLE WITH WRITE TO EXCEL e30<-read.table("e30.csv", header=TRUE, sep=",",na.strings="NA") attach(e30) DFD<-e30[,5:11] percent.disabled<-as.numeric(N.stud.disable)/as.numeric(class.enroll) DFSD<-data.frame(DFD,percent.disabled) DFSD<-na.omit(DFSD) #_________________________________________________________________________ #DESCRIPTIVES ON AIDES # ZZ<-summaryBy(cl.behav.fall+cl.behav.spr+perc.min+percent.disabled~aide, data = DFSD,FUN = function(x) { c(n = length(x),mean = mean(x), sd = sd(x)) } ) #_________________________________________________________________________ #DESCRIPTIVES ON CLASS TYPE # YY<-summaryBy(cl.behav.fall+cl.behav.spr+perc.min+percent.disabled~cl.type,data = DFSD,FUN = function(x) { c(n = length(x),mean = mean(x), sd = sd(x)) } ) #_________________________________________________________________________ #DESCRIPTIVES ON CLASS TYPE & AIDE # XX<-summaryBy(cl.behav.fall+cl.behav.spr+perc.min+percent.disabled~aide+cl.type, data = DFSD,FUN = function(x) { c(n = length(x),mean = mean(x), sd = sd(x)) } ) #_________________________________________________________________________ #TRANSFORMED DESCRIPTIVES ON CLASS TYPE & AIDE # WW<-t(summaryBy(cl.behav.fall+cl.behav.spr+perc.min+percent.disabled~aide+cl.type, data = DFSD,FUN = function(x) { c(n = length(x),mean = mean(x), sd = sd(x)) } )) DESCRIBE2<-list(ZZ,YY,XX,WW) DESCRIBE2 write.table(ZZ, write.table(YY, write.table(XX, write.table(WW, file file file file = = = = "DESCRIBE2.csv", "DESCRIBE3.csv", "DESCRIBE4.csv", "DESCRIBE5.csv", sep sep sep sep = = = = ",", ",", ",", ",", col.names col.names col.names col.names = = = = NA,qmethod NA,qmethod NA,qmethod NA,qmethod = = = = "double") "double") "double") "double")

Descriptive by Group (another approach) by(data set, factor, summary)


Example mtcars2<-mtcars library(doBy) mtcars2$cyl<-with(mtcars,recodeVar(cyl, src=c(4,6,8), tgt=c("four","six","eight"), default=NULL, keep.na=TRUE)) by(mtcars2, mtcars2$cyl, summary)

131 | P a g e

Using ftable and tapply to generate descriptives Example


dat<-read.table("HW19.csv", header=TRUE, sep=",",na.strings="999") dat$Attitude<-as.factor(dat$Attitude) mod<-lm(Science.Comprehension~Gender*Attitude*Grade,data=dat) anova(mod) #================================================================================================== # USING FTABLE AND TAPPLY TO GENERATE TABLES OF MEAN, SD, N, VAR ETC. BY GROUP #================================================================================================== with(dat,ftable(tapply(Science.Comprehension,list(Grade,Gender,Attitude),mean))) with(dat,ftable(tapply(Science.Comprehension,list(Grade,Gender,Attitude),sd))) with(dat,ftable(tapply(Science.Comprehension,list(Grade,Gender,Attitude),length))) #================================================================================================== # USING FTABLE AND TAPPLY TO COMPLIE 1 TABLE OF MEAN, SD, N, VAR ETC. BY GROUP #================================================================================================== DF<-with(dat,as.data.frame.table(tapply(Science.Comprehension,list(Grade,Gender, Attitude),mean))) stndev<-as.vector(with(dat,ftable(tapply(Science.Comprehension,list(Grade,Gender,Attitude),sd)))) n<-as.vector(with(dat,ftable(tapply(Science.Comprehension,list(Grade,Gender,Attitude),length)))) DF<-data.frame(DF[,1:3],n,DF[,4],stndev) colnames(DF)<-c("Grade","Gender","Attitude","n","mean","sd") #================================================================================================== # TABLE OF N,MEANS & SD BY GROUP #================================================================================================== DF

Descriptives by Group Favored method) It is easiest to look at how this piece of code works through an example:
#EXAMPLE library(reshape) dstats <- function(x)(c(n=length(x), mean=mean(x), sd=sd(x), med=median(x))) dfm <- melt(mtcars, measure.vars=c("mpg", "hp", "wt"), id.vars=c("am", "cyl")) cast(dfm, am + cyl + variable ~ ., dstats)

Descriptives by Variable1 library(pastecs) stat.desc(data.frame) Descriptives by Variable2 library(fBasics) basicStats(dataframe)

132 | P a g e

Note: Correlation Matrices and Plots scorrelationUse round(cor(),2) to round the table to 2 decimals Correlation Package Ive created
source("C:/Users/Rinker/Desktop/PhD Program/CEP 523-Stat Meth Ed Inference/R Stuff/Scripts/Regression/Correlation Matrix and Data Visualization.txt") cmat()

Select Numeric Columns for the cor function Useful for cor()
Method 1 sapply(dataframe, is.numeric) Method 2 which(sapply(data.frame, is.numeric)) EXAMPLE With cor() cor[iris] #the problem cor(iris[as.numeric(which(sapply(iris, is.numeric)))])#the fix

Correlation and Correlation Tables Type: cor( x,y, use="complete.obs") Output

Where x and y are single numeric variables the correlation is a single value. To correlate more than two numeric variables: 1) The first step is to bind your outcome variables: y<-cbind(x,y,z) #or see selecting numeric variables 2) The last step is to type: cor(y, use="complete.obs") The output will be a correlation table. Note: If you have changed all the variables to numeric as described in the changing variable section you can simply type: cor(data1) Where data1 is the data set, however some of the numeric conversions (as in age level y, m, o = 1,2,3) are inappropriate correlations.

133 | P a g e

Correlation Tables and p-values Use rcorr() from the car package to get correlation matrix and a p-value matrix. Correlation matrix w/ ns and pvalues (does pairwise deletion) rcorr(x, y, type=c("pearson","spearman")) library(Hmisc)

Could make it do pairwise by doing:

rcorr(na.omit(x), y, type=c("pearson","spearman")) Pairwise Associations between Items using a Correlation Coefficient library(ltm) This one is similar to rcorr above rcor.test(mat, p.adjust = FALSE, p.adjust.method = "holm", ...)

134 | P a g e

Correlation matrix w/ ns and pvalues library(psych) corr.test(x, y = NULL, use = "pairwise", method="pearson")
x

A matrix or dataframe y A second matrix or dataframe with the same number of rows as x use use="pairwise" is the default value and will do pairwise deletion of cases. use="complete" will select just complete cases. method method="pearson" is the default value. The alternatives to be passed to cor are "spearman" and "kendall"
Example library(psych) corr.test(mtcars)

Correlation Matrix w/sig stars sigstarC(dataset)


source("C:/Users/Rinker/Desktop/PhD Program/CEP 523-Stat Meth Ed Inference/R Stuff/Scripts/Regression/Correlation Matrix With Sig Stars.txt

Find the significance of the difference between (un)paired correlations library(psych) paired.r(xy, xz, yz=NULL, n, n2=NULL,twotailed=TRUE) Arguments xy r(xy) xz r(xz) yz r(yz) n Number of subjects for first group n2 Number of subjects in second group (if not equal to n) twotailed Calculate two or one tailed probability values Description Test the difference between two (paired or unpaired) correlations. Given 3 variables, x, y, z, is the correlation between xy different than that between xz? If y and z are independent, this is a simple t-test of the z transformed rs. But, if they are dependent, it is a bit more complicated. To find the z of the difference between two independent correlations, first convert them to z scores using the Fisher r-z transform and then find the z of the difference between the two correlations. The default assumption is that the group sizes are the same, but the test can be done for different size groups by specifying n2. If the correlations are not independent (i.e., they are from the same sample) then the correlation with the third variable r(yz) must be specified. Find a t statistic for the difference of three two dependent correlations. 135 | P a g e

Correlation matrix, correlation plots, and histograms of the variables library(psych) pairs.panels(x, smooth = TRUE, scale = FALSE, density=TRUE,ellipses=TRUE,digits=2)

scatterplot matrix

Arguments x a data.frame or matrix smooth TRUE draws loess smooths scale TRUE scales the correlation font by the size of the absolute correlation. density TRUE shows the density plots as well as histograms ellipses TRUE draws correlation ellipses lm Plot the linear fit rather than the LOESS smoothed fits. digits the number of digits to show pch The plot character (defaults to 20 which is a .). cor If plotting regressions, should correlations be reported? jiggle Should the points be jittered before plotting? factor factor for jittering (1-5) hist.col What color should the histogram on the diagonal be? show.points If FALSE, do not show the data points Description Adapted from the help page for pairs, pairs.panels shows a scatter plot of matrices (SPLOM), with bivariate scatter plots below the diagonal, histograms on the diagonal, and the Pearson correlation above the diagonal. Useful for descriptive statistics of small data sets. If lm=TRUE, linear regression fits are shown for both y by x and x by y. Correlation ellipses are also shown. Points may be given different colors depending upon some grouping variable.

136 | P a g e

Correlation plot represented by colors library(psych) cor.plot(dfg,colors=TRUE, n=10,main=NULL,zlim=c(-1,1),show.legend=TRUE,labels=NULL)

Correlation Test With p values (pairwise or complete obs.) library(psych) corr.test(x, y = NULL, use = "pairwise",method="pearson") Description Although the cor function finds the correlations for a matrix, it does not report probability values. corr.test uses cor to find the correlations for either complete or pairwise data and reports the sample sizes and probability values as well. Arguments x A matrix or dataframe y A second matrix or dataframe with the same number of rows as x use use="pairwise" is the default value and will do pairwise deletion of cases. use="complete" will select just complete cases. method method="pearson" is the default value. The alternatives to be passed to cor are "spearman" and "kendall" Details corr.test uses the cor function to find the correlations, and then applies a t-test to the individual correlations using the formula Value r The matrix of correlations n Number of cases per correlation t value of t-test for each correlation p two tailed probability of t for each correlation

137 | P a g e

Correlation Matrix Plot Using Pie Graphs and Colors

library(corrgram)

corrgram(x, order = NULL, panel=panel.txt, lower.panel=panel.shade, upper.panel=panel.pie, diag.panel=NULL, text.panel=panel.txt, label.pos=0.5, cex.labels=NULL, font.labels=1, row1attop=TRUE, gap=0, main=NULL) Options x is a dataframe with one observation per row.

order=TRUE will cause the variables to be ordered using principal component analysis of the correlation matrix. panel= refers to the off-diagonal panels. You can use lower.panel= and upper.panel= to choose different options below and above the main diagonal respectively. text.panel= and diag.panel= refer to the main diagnonal. Allowable parameters are given below.

off diagonal panels panel.pie (the filled portion of the pie indicates the magnitude of the correlation) panel.shade (the depth of the shading indicates the magnitude of the correlation) panel.ellipse (confidence ellipse and smoothed line) panel.pts (scatterplot) Lines on the shade indicate direction. main diagonal panels Shade color and pie panel.minmax (min and max values of the variable) indicates magnitude. panel.txt (variable name). Use this function before plotting to change the colors used: col.corrgram <- function(ncol){ colorRampPalette(c("purple, "red","lightred", "pink"))(ncol)}

138 | P a g e

Correlation plot represented by colors library(psych) cor.plot(dfg,colors=TRUE, n=10,main=NULL,zlim=c(-1,1),show.legend=TRUE,labels=NULL)

Correlation Test With p values (pairwise or complete obs.) library(psych) corr.test(x, y = NULL, use = "pairwise",method="pearson") Description Although the cor function finds the correlations for a matrix, it does not report probability values. corr.test uses cor to find the correlations for either complete or pairwise data and reports the sample sizes and probability values as well. Arguments x A matrix or dataframe y A second matrix or dataframe with the same number of rows as x use use="pairwise" is the default value and will do pairwise deletion of cases. use="complete" will select just complete cases. method method="pearson" is the default value. The alternatives to be passed to cor are "spearman" and "kendall" Details corr.test uses the cor function to find the correlations, and then applies a t-test to the individual correlations using the formula Value r The matrix of correlations n Number of cases per correlation t value of t-test for each correlation p two tailed probability of t for each correlation

139 | P a g e

Correlation Matrix Plot Using Pie Graphs and Colors

library(corrgram)

corrgram(x, order = NULL, panel=panel.txt, lower.panel=panel.shade, upper.panel=panel.pie, diag.panel=NULL, text.panel=panel.txt, label.pos=0.5, cex.labels=NULL, font.labels=1, row1attop=TRUE, gap=0, main=NULL) Options x is a dataframe with one observation per row.

order=TRUE will cause the variables to be ordered using principal component analysis of the correlation matrix. panel= refers to the off-diagonal panels. You can use lower.panel= and upper.panel= to choose different options below and above the main diagonal respectively. text.panel= and diag.panel= refer to the main diagnonal. Allowable parameters are given below.

off diagonal panels panel.pie (the filled portion of the pie indicates the magnitude of the correlation) panel.shade (the depth of the shading indicates the magnitude of the correlation) panel.ellipse (confidence ellipse and smoothed line) panel.pts (scatterplot) Lines on the shade indicate direction. main diagonal panels Shade color and pie panel.minmax (min and max values of the variable) indicates magnitude. panel.txt (variable name). Use this function before plotting to change the colors used: col.corrgram <- function(ncol){ colorRampPalette(c("purple, "red","lightred", "pink"))(ncol)}

140 | P a g e

Correlation Hypothesis testing Correlation Package Ive created for testing Correlation Hypot heses
source("C:/Users/Rinker/Desktop/PhD Program/CEP 523-Stat Meth Ed Inference/R Stuff/Scripts/Regression/Correlation Matrix and Data Visualization.txt") cmat()

Confidence Interval for a Correlation Coefficient CIr(r, n, level = 0.95)

library(psychometric)

Arguments r Correlation Coefficient n Sample Size level Significance Level for constructing the CI, default is .95 Convert r values to z scores fisherz <- function(rho) {0.5*log((1+rho)/(1-rho)) } #converts r to z fisherz(r) OR

using library(psych) fisherz(rho) fisherz2r(z) r.con(rho,n,p=.95,twotailed=TRUE) r2t(rho,n)

Description convert a correlation to a z score or z to r using the Fisher transformation or find the confidence intervals for a specified correlation Convert a Pearson correlation coefficient to Fishers z r2z(x) Where x is the Pearson correlation coefficient library(psychometric)

141 | P a g e

Confidence Interval for Fisher z CIz(z, n, level = 0.95)

library(psychometric)

Arguments z Fishers z n Sample Size level Significance Level for constructing the CI, default is .95 Convert a Fishers z to aPearson correlation coefficient z2r(x) Where x is the Fishers z library(psychometric)

142 | P a g e

Find partial correlation between two variables, with other vars removes library((ggm) pcor(u, S)
u

a vector of integers of length > 1. The first two integers are the indices of variables the correlation of which must be computed. The rest of the vector is the conditioning set. S a symmetric positive definite matrix, a sample covariance matrix.
EXAMPLE library(ggm) data(marks) with(marks, cor(vectors, algebra)) #cor not accounting for anything else ## The correlation between vectors and algebra given analysis and statistics pcor(c("vectors", "algebra", "analysis", "statistics"), var(marks))

Find the partial correlations for a set (x) of variables with set (y) removed partial.r(m, x, y) Arguments m A data or correlation matrix x The variable numbers associated with the X set. y The variable numbers associated with the Y set

library(psych)

Description A straightforward application of matrix algebra to remove the effect of the variables in the y set from the x set. Input may be either a data matrix or a correlation matrix. Variables in x and y are specified by location. It is sometimes convenient to partial the effect of a number of variables (e.g., sex, age, education) out of the correlations of another set of variables. This could be done laboriously by finding the residuals of various multiple correlations, and then correlating these residuals. The matrix algebra alternative is to do it directly. EXAMPLE cool <- make.hierarchical() #make up a correlation matrix round(cool[1:5,1:5],2) partial.r(cool,c(1,3,5),c(2,4))

Find the partial correlations for a set correlation or covariance matrix


cor2pcor(m, tol)

library(corpcor)

143 | P a g e

Tests the hypothesis that two correlations are significantly different

library(psychometric)

rdif.nul(r1, r2, n1, n2) (one tailed; p must be doubled to get p value for 2 tail) Arguments r1 Correlation 1 r2 Correlation 2 n1 Sample size for r1 n2 Sample size for r2 Details First converts r to z for each correlation. Then constructs a z test for the difference z <- (z1 z2)/sqrt(1/(n1-3)+1/(n2-3)) Returns a table with 2 elements zDIF z value for the H0 p p value

144 | P a g e

Tests of significance for correlations

library = psych()

r.test (n, r12, r34 = NULL, r23 = NULL, r13 = NULL, r14 = NULL, r24 = NULL, n2 = NULL, pooled = TRUE, twotailed = TRUE) Arguments n Sample size of first group r12 Correlation to be tested r34 Test if this correlation is different from r12, if r23 is specified, but r13 is not, then r34 becomes r13 r23 if ra = r(12) and rb = r(13) then test for differences of dependent correlations given r23 r13 implies ra =r(12) and rb =r(34) test for difference of dependent correlations r14 implies ra =r(12) and rb =r(34) r24 ra =r(12) and rb =r(34) n2 n2 is specified in the case of two independent correlations. n2 defaults to n if if not specified pooled use pooled estimates of correlations twotailed should a twotailed or one tailed test be used Description Tests the significance of a single correlation, the difference between two independent correlations, the difference between two dependent correlations sharing one variable (Williamss Test), or the difference between two dependent correlations with different variables (Steiger Tests).

Details Depending upon the input, one of four different tests of correlations is done. 1. For a sample size n, find the t value for a single correlation. 2. For sample sizes of n and n2 (n2 = n if not specified) find the z of the difference between the z transformed correlations divided by the standard error of the difference of two z scores. 3. For sample size n, and correlations ra= r12, rb= r23 and r13 specified, test for the difference of two dependent correlations. 4. For sample size n, test for the difference between two dependent correlations involving different variables. For clarity, correlations may be specified by value. If specified by location and if doing the test of dependent correlations, if three correlations are specified, they are assumed to be in the order r12, r13, r23. Value test Label of test done z z value for tests 2 or 4 t t value for tests 1 and 3 p probability value of z or t

145 | P a g e

Nil hypothesis for a correlation (Does r = 0?) r.nil(r, n) Arguments r Correlation coefficient n Sample Size Performs a one-tailed t-test of the H0 that r = 0 Returns a table with 4 elements H0:rNot0 correlation to be tested t t value for the H0 df degrees of freedom p p value

library(psychometric)

146 | P a g e

Cronbachs Alpha library(psych) alpha(x, keys=NULL,cumulative=FALSE, title=NULL, max=10,na.rm = TRUE)

Cronbachs Alpha library(psy) cronbach(v1) v1 = n*p matrix or dataframe, n subjects and p items Missing value are omitted in a "listwise" way (all items are removed even if only one of them is missing). Cronbachs Alpha library(psychometric) alpha(x) Where x is a data.frame

147 | P a g e

Cronbachs Alpha library(ltm) cronbach.alpha(data, standardized = FALSE, CI = FALSE, probs = c(0.025, 0.975), B = 1000, na.rm = FALSE)

Confidence Interval for Coefficient Alpha (1 or 2 tailed) First calculate an alpha and then: alpha.CI(alpha, k, N, level = 0.90, onesided = FALSE)

library(psychometric)

148 | P a g e

Descriptive Statistics for a response data frame (includes Cron. Alpha and lots of goodies) descript(data, n.print = 10, chi.squared = TRUE, B = 1000) library(ltm)

Returns:

149 | P a g e

150 | P a g e

Alternative reliability Analysis guttman(r,key=NULL) tenberge(r) glb(r,key=NULL) glb.fa(r,key=NULL)

library(psych)

?guttman

Arguments r A correlation matrix or raw data matrix. key a vector of -1, 0, 1 to select or reverse items

Estimation of a True Score Est.true(obs, mx, rxx)

library(psychometric)

Arguments obs an observed score on test x mx mean of test x rxx reliability of test x Description Given the mean and reliability of a test, this function estimates the true score based on an observed score. The estimation is accounting for regression to the mean

Spearman-Brown Prophecy Formulae SBrel(Nlength, rxx) SBlength(rxxp, rxx)

library(psychometric)

Arguments Nlength New length of a test in relation to original rxx reliability of test x rxxp reliability of desired (parallel) test x Returns: rxxp - the prophesized reliability; N -Ratio of new test length to original test length

151 | P a g e

Item Analysis (Gives lots of info from a samples responses) item.exam(x, y = NULL, discrim = FALSE)

library(psychometric)

Arguments x matrix or data.frame of items y Criterion variable discrim Whether or not the discrimination of item is to be computed Description Conducts an item level analysis. Provides item-total correlations, Standard deviation in items, difficulty,discrimination, and reliability and validity indices.
Details If someone is interested in examining the items of a dataset contained in data.frame x, and the criterion measure is also in data.frame x, one must parse the matrix or data.frame and specify each part into the function. See example below. Otherwise, one must be sure that x and y are properly merged/matched. If one is not interested in assessing item-criterion relationships, simply leave out that portion of the call. The function does not check whether the items are dichotomously coded, this is user specified. As such, one can specify that items are binary when in fact they are not. This has the effect of computing the discrimination index for continuously coded variables. The difficulty index (p) is simply the mean of the item. When dichotomously coded, p reflects the proportion endorsing the item. However, when continuously coded, p has a different interpretation.

152 | P a g e

Grade multiple choices (uses mult.choice(data, correct)

multiple choice data set and answer key to give correct(1) incorrect (0)

library(ltm)

Arguments data a matrix or a data.frame containing the manifest variables as columns. correct a vector of length ncol(data) with the correct responses (answer key) This new matrix could then be used to do column sums; row sums; weighting of questions; grades; weighted grades based on questions weights. Find Intraclass Correlations (ICC1, ICC2, ICC3 from Shrout and Fleiss) of two raters (numeric) ICC(x,missing=TRUE,alpha=.05) Arguments x missing alpha library(psych)

a matrix or dataframe of ratings if TRUE, remove missing data work on complete cases only The alpha level for significance for finding the confidence intervals

Description The Intraclass correlation is used as a measure of association when studying the reliability of raters. Shrout and Fleiss (1979) outline 6 different estimates, that depend upon the particular experimental design. All are implemented and given confidence limits.

Intraclass correlation coefficient (ICC) icc(data)

package psy()

data = n*p matrix or dataframe, n subjects p raters Details Missing data are omitted in a listwise way. The "agreement" ICC is the ratio of the subject variance by the sum of the subject variance, the rater variance and the residual; it is generally prefered. The "consistency" version is the ratio of the subject variance by the sum of the subject variance and the residual; it may be of interest when estimating the reliability of pre/post variations in measurements.

153 | P a g e

Find Cohens kappa and weighted kappa coefficients for correlation of two raters (nominal) cohen.kappa(x, w=NULL,n.obs=NULL,alpha=.05) wkappa(x, w = NULL) Arguments x Either a two by n data with categorical values from 1 to p or a p x p table. If data rray, a table will be found. w A p x p matrix of weights. If not specified, they are set to be 0 (on the diagonal) and (distance from diagonal) off the diagonal)^2. n.obs Number of observations (if input is a square matrix. alpha Probability level for confidence intervals Description Cohens kappa (Cohen, 1960) and weighted kappa (Cohen, 1968) may be used to find the agreement of two raters when using nominal scores. weighted.kappa is (probability of observed matches - probability of expected matches)/(1 probability of expected matches). Kappa just considers the matches on the main diagonal. Weighted kappa considers off diagonal elements as well. Find Cohens kappa and weighted kappa coefficients f or correlation of two raters (nominal) wkappa(r,weights="squared") library psy() library(psych)

Arguments r n*2 matrix or dataframe, n subjects and 2 raters weights weights="squared" to obtain squared weights. If not, absolute weights are computed Details The diagnoses are ordered as follow: numbers < letters, letters and numbers ordered naturally. For weigths="squared", weights are related to squared differences between rows and columns indices (in this situation wkappa is close to an icc). For weights!="squared", weights are related to absolute values of differences between rows and columns indices. The function deals with the case where the two raters have not exactly the same scope of rating (some software associate an error with this situation). Missing value are omitted.

154 | P a g e

Reverse Scoring

library(psych)

reverse.code(keys, items, mini = NULL, maxi = NULL)

EXAMPLE original <- matrix(sample(6,50,replace=TRUE),10,5) keys <- c(1,1,-1,-1,1) #reverse the 3rd and 4th items new <- reverse.code(keys,original,mini=rep(1,5),maxi=rep(6,5)) NOTE: Reverse scoring can also be accomplished by taking the item and creating a new rescored variable using the formula: (m+1)-s = reverse scored item Where m is the max score you could have gotten on a Likert type scale and s is the score vector containing the scores of the item that is to be reverse scored.

155 | P a g e

TABULAR DATA Table of Counts (frequency table) [nested] table(factor1,factor2,n factor) ftable(factor1,factor2,n factor) Note: you can use describe by tilde ~ (see example)
EXAMPLE DF<-read.table("fake remdial reading (logistic regression example).csv", header=TRUE, sep=",",na.strings="999") DF<-data.frame(DF,"Fav.Color"=sample(c("blue","red","green","orange"),nrow(DF),replace=T)) with(DF,table(sex,rem.read.rec)) with(DF,table(sex,rem.read.rec,Fav.Color)) #too cumbersome so we use ftable with(DF,ftable(sex,rem.read.rec,Fav.Color)) with(DF,ftable(rem.read.rec~sex+Fav.Color)) with(DF,ftable(sex+Fav.Color~rem.read.rec))

Un-nested table of counts margin.table(table,factor # to reveal)


EXAMPLE: DF<-read.table("fake remdial reading (logistic regression example).csv", header=TRUE, sep=",",na.strings="999") DF<-data.frame(DF,"Fav.Color"=sample(c("blue","red","green","orange"),nrow(DF),replace=T)) (tab2<-with(DF,table(sex,Fav.Color,rem.read.rec))) margin.table(tab2,1) margin.table(tab2,2) margin.table(tab2,3)

Compute column and row sums for a table method 1 addmargins(table)

Compute column and row sums for a table method 2 mar_table(x)

library(vcd)

d<-data.frame(matrix(c(sample(c("red","blue", "green"), 25, replace=T), sample(c(letters[1:5]), 25, replace=T), sample(c("DOG","CAT", "CHICKEN", "SNAKE"), 25, replace=T)), nrow=25, ncol=3)) DT <- with(d, table(X1,X3)) with(d, chisq.test(X1,X2)) with(d, fisher.test(X1,X2)) DT<-with(d,xtabs(~X1+X3)) addmargins(DT)

mar_table(DT)

156 | P a g e

Table of Counts for Proportion Tables prop.table(table)


EXAMPLE DF<-read.table("fake remdial reading (logistic regression example).csv", header=TRUE, sep=",",na.strings="999") DF<-data.frame(DF,"Fav.Color"=sample(c("blue","red","green","orange"),nrow(DF),replace=T)) with(DF,table(sex,rem.read.rec)) with(DF,table(sex,rem.read.rec,Fav.Color)) with(DF,ftable(sex,rem.read.rec,Fav.Color)) with(DF,ftable(rem.read.rec~sex+Fav.Color)) (tab1<-with(DF,ftable(sex+Fav.Color~rem.read.rec))) prop.table(tab1,1) (percentTABLE<-prop.table(tab1,1)*100) (tab2<-with(DF,table(sex,Fav.Color,rem.read.rec))) prop.table(tab2,1) (percentTABLE<-prop.table(tab2,1)*100)

Tabular Data 2 x 2 Table

Chi squared test of independence 2 x 2

summary(table(factor1,factor2))
EXAMPLE DF<-data.frame(cbind("X1"=c(rep("yes",12),rep("no",12)),"X2"=c(rep("red",10),rep("blue",14)))) with(DF,summary(table(X1,X2)))

Tabular Data 2 x 2 Table with Yates Continuity Correction Chi squared test of independence chisq.test(table(factor1,factor2))
EXAMPLE DF<-data.frame(cbind("X1"=c(rep("yes",12),rep("no",12)),"X2"=c(rep("red",10),rep("blue",14)))) with(DF, chisq.test(table(X1,X2)))

Compute a table of expected frequencies (used by chisq.test) independence_table(x, frequency = c("absolute", "relative"))

library(vcd)

x is a table. frequency indicates whether absolute or relative frequencies should be computed.


EXAMPLE x<-ftable(mtcars[,c(2,8)]) independence_table(x)

Tabular Data 2 x 2 and larger Table fisher.test(table(factor1,factor2))


EXAMPLE with(warpbreaks,fisher.test(table(wool,tension)))

157 | P a g e

Cross Tabulation with Tests for Factor Independence library(gmodels)


CrossTable(x, y, digits=3, max.width = 5, expected=FALSE, prop.r=TRUE, prop.c=TRUE, prop.t=TRUE, prop.chisq=TRUE, chisq = FALSE, fisher=FALSE, mcnemar=FALSE, resid=FALSE, sresid=FALSE, asresid=FALSE, missing.include=FALSE, format=c("SAS","SPSS"), dnn = NULL, ...)
Arguments
x A vector or a matrix. If y is specified, x must be a vector y A vector in a matrix or a dataframe digits Number of digits after the decimal point for cell proportions max.width In the case of a 1 x n table, the default will be to print the output horizontally. If the number of columns exceeds max.width, the table will be wrapped for each successive increment of max.width columns. If you want a single column vertical table, set max.width to 1 expected If TRUE, chisq will be set to TRUE and expected cell counts from the Chi-Square will be included prop.r If TRUE, row proportions will be included prop.c If TRUE, column proportions will be included prop.t If TRUE, table proportions will be included prop.chisq If TRUE, chi-square contribution of each cell will be included chisq If TRUE, the results of a chi-square test will be included fisher If TRUE, the results of a Fisher Exact test will be included mcnemar If TRUE, the results of a McNemar test will be included resid If TRUE, residual (Pearson) will be included sresid If TRUE, standardized residual will be included asresid If TRUE, adjusted standardized residual will be included missing.include If TRUE, then remove any unused factor levels format Either SAS (default) or SPSS, depending on the type of output desired. dnn the names to be given to the dimensions in the result (the dimnames names).

EXAMPLES library(gmodels) CrossTable(infert$education, infert$induced, expected = TRUE,chisq = T, fisher=TE, mcnemar=T,resid=T, sresid=T, asresid=T, missing.include=T) #&&&&&&&&&&&&&&&&&&&&& CrossTable(mtcars$cyl,mtcars$vs,format="SAS") CrossTable(mtcars$cyl,mtcars$vs,format="SPSS")

Combine columns or rows of a cross table library(vcdExtra) collapse.table(table)


#EXAMPLE library(vcdExtra) # create some sample data in table form sex <- c("Male", "Female") age <- letters[1:6] education <- c("low", "med", "high") data <- expand.grid(sex=sex, age=age, education=education) counts <- rpois(36, 100) data <- cbind(data, counts) (t1 <- xtabs(counts ~ sex + age + education, data=data)) # collapse age to 3 levels (t2 <- collapse.table(t1, age=c("A", "A", "B", "B", "C", "C"))) # collapse age to 3 levels and pool education: "low" and "med" to "low" (t3 <- collapse.table(t1, age=c("A", "A", "B", "B", "C", "C"))) education=c("low", "low", "high")) # change labels for levels of education to 1:3 (t4 <- collapse.table(t1, education=1:3))

158 | P a g e

Strength of Effect Measures [SOE] (Tabular Data) Read Measures of association in crosstab tables article for SOE measures decisions Compute Pearson 2, Likelihood Ratio 2, coefficient, contingency coefficient & Cramer's V assocstats(x) library(vcd)

EXAMPLES data("Arthritis") Arthritis (tab <- xtabs(~Improved + Treatment, data = Arthritis)) summary(assocstats(tab)) #AND x<-ftable(mtcars[,c(2,8)]) summary(assocstats(x))

Cohen's kappa and weighted kappa for a confusion matrix kappa(z)


Z

library(vcd)

is a matrix or a the result of qr or a fit from a class inheriting from "lm".


Examples kappa(x1 <- cbind(1,1:10))# 15.71 kappa(x1, exact = TRUE) # 13.68 kappa(x2 <- cbind(x1,2:11))# high! [x2 is singular!]

159 | P a g e

Turn a table into a dataframe METHOD 1 table2flat(table) table- can be ftable, xtabs, table
#EXAMPLE x <- with(mtcars,table(am, gear, cyl, vs)) table2flat(x) x2 <- with(mtcars,ftable(am, gear, cyl, vs)) table2flat(x2)

#CODE table2flat <- function(mytable){ #by Robert Kabakoff df <- as.data.frame(mytable) rows <- dim(df)[1] cols <- dim(df)[2] x <- NULL for (i in 1:rows){ for (j in 1:df$Freq[i]){ row <- df[i, c(1:(cols-1))] x <- rbind(x, row) } } row.names(x) <- c(1:dim(x)[1]) return(x) }

Turn a table into a dataframe METHOD2

library(vcdExtra)

expand.dft(x, var.names = NULL, freq = "Freq", ...) expand.table(x, var.names = NULL, freq = "Freq", ...)

art <- xtabs(~Treatment + Improved, data = Arthritis) art expand.dft(art)

Categorical Article Data (Cross Table) to Raw Data


Below is an example starting with creating a table from numeric values (replicate data frame from results) FROM AN ARTICLE TO A TABLE TO RAW DATA
#=================================================================== # CREATE THE DATA FRAME FROM A MATRIX OF FREQUENCIES #=================================================================== d2 <- matrix(c(23, 15, 66, 34, 19, 22), ncol=3, nrow=2) dimnames(d2) <-list(Gender=c("boys", "girls"), Inst.Meth=c("direct.int", "explicit.learn", "didactic")) d2<-as.table(d2) d2 #=================================================================== expand.dft(d2) #BOTH FUNCTIONS WILL RETURN THE DATA FRAME table2flat(d2)

160 | P a g e

ANOVA ANOVA (balanced or not; as many ways as you want [1 way, 2 way, 3 way ]) linear model Type: anova(lm(sc~ g)) One way Analysis of Variance Table Response: sc Df Sum Sq Mean Sq F value Pr(>F) g 1 18.778 18.7778 6.7759 0.01359 * Residuals 34 94.222 2.7712 --Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1 anova(lm(sc~ s*a*g)) Multi-way (gives ineractions) Analysis of Variance Table Response: sc Df Sum Sq Mean Sq F value Pr(>F) s 1 4.000 4.0000 1.8947 0.18138 a 2 23.167 11.5833 5.4868 0.01091 * g 1 18.778 18.7778 8.8947 0.00647 ** s:a 2 1.167 0.5833 0.2763 0.76095 s:g 1 11.111 11.1111 5.2632 0.03083 * a:g 2 1.389 0.6944 0.3289 0.72287 s:a:g 2 2.722 1.3611 0.6447 0.53365 Residuals 24 50.667 2.1111 --Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1 anova(lm(sc~ s+a+g)) Multi-way (gives only main effects) Analysis of Variance Table Response: sc Df Sum Sq Mean Sq F value Pr(>F) s 1 4.000 4.0000 1.8492 0.183684 a 2 23.167 11.5833 5.3550 0.010055 * g 1 18.778 18.7778 8.6810 0.006056 ** Residuals 31 67.056 2.1631 --Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1 Note: sc is the dependent variable s,a,g are the main effect (IV), s*a,s*g,a*g,&s*a*g are the interactions.

161 | P a g e

Analysis of Variance AOV Model Type: z.aov<-aov(y~factor1*factor2*factor3) Where z.aov is the output, y is the DV numeric scores, factor1-factor2-factor3 ar your categorical IVs. summary(z.aov) This will give you the same output as the summary(anova(lm(y~ factor1*factor2*factor3)) from the linear model approach. Means Tables (load car package) This is for after you have run an aov model (make sure youve labeled the categorical IVs as factors using as.factors function): model.tables(z.aov,"means",se=T) Where z.aov is the output label for the aov model youve just run. This gives you the means tables for main and interaction effects. Residual plots plot(model) example: plot(hw.aov)

Where model is the aov model. Post Hoc & Protected Tests (for use after ANOVA) Tukey TukeyHSD(model) [example: TukeyHSD(z.aov)] Where model is the output label for the aov.

162 | P a g e

MANOVA I will use the following data set to illustrate the MANOVA I usually go in and change the variable names to something simple using the command: s<-data$Study.Group Then make sure your categorical variables are factors using the command: s<-as.factor(s) The next step is to bind your outcome variables: y<-cbind(c,l,h) The output (when entering y) should look like this

Now you can check for outliers using the aq.plot command from the mvoutlier package: mvoutlier(y)
> aq.plot(y ) Projection to the first and second robust principal components. Proportion of total variation (explained variance): 0.8742298 $outliers [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE TRUE FALSE [15] TRUE TRUE FALSE TRUE FALSE FALSE FALSE FALSE FALSE TRUE Warning message: In princomp.default(x, covmat = covr) : both 'x' and 'covmat' were supplied: 'x' will be ignored
24

1.0

12 10 14 23 21 20 13 15 16 11 9 6 7 19 17 8 22

-2

34

-4

-6

-4

-2

0.2

0.4

18

3 12 14 2 4 17 5 22 7 11 23 9 19 6 1 21 20

24 13 10

97.5% Quantile Adjusted Quantile

Cumulative probability

0.6

0.8

8 15 18

16

10

20

30

40

50

Ordered squared robust distance

Outliers based on 97.5% quantile


24 12 10 14 23 21 20 13 11 9 6 7 19 17 22

Outliers based on adjusted quantile


24 12 10 14 23 21 20 13 11 9 6 7 19 17 8 15 16 22

18 5

18 5

-2

-4

15

16

-6

-4

-2

-4

-2

2 4

2 4

-6

-4

-2

163 | P a g e

Next you can run the MANOVA using the following command: v<-manova(lm(y ~ s * r)) The v is an arbitrary choice just as y was in the last step. The s*r will give you the main effects and the interaction as well. The output (when entering v) should To get the f statistics you need to call command: summary(v) By default the test will be the Pillai. If you wish to change the output to Wilks enter: summary(v,test="Wilks") The output should look like this Note: you can also change Wilks to Roy From here you should go onto running individual anovas on each of the out come variables to complete the anova table. look like this up a summary

MANOVA Correlation Table(for the DVs) Type: cor(y) Note: the y is the bind numeric variables. The output table will be as a correlation table.

164 | P a g e

Repeated Measures Anova (Balanced Design)


It is inappropriate to use this for unbalanced and/or missing values)

A 3 time no between factors model (IV-Subject[categorical and random]; IV-Meal[categorical and fixed]; DV-Cholesterol Rating[numeric]) Data Set Type: ex20.aov <- aov(Cholesterol.Intake ~factor(Meal) + Error(factor(Student))) Where Cholesterol.Intake is the DV, Meal is an IV fixed factor, and Student is an IV random factor. [output]

A 3 time within and between factors model (IV-Subject[categorical and random]; IVGender[categorical and fixed]; Instructor Type[categorical and fixed]; DV-Cholesterol Rating[numeric])

> ex27.10.aov <- aov(Study.Time ~factor(Instructor.Type)*factor(Gender) + Error(factor(Student))) And then > summary(ex27.10.aov) [output] Note: see section 27:12 for the ANOVA table that corresponds to this output.

165 | P a g e

Repeated Measures Anova (balanced or unbalanced) [relies on the car package] A 3 time no between factors model (IV-Subject[categorical and random]; IV-Meal[categorical and fixed]; DV-Cholesterol Rating[numeric]) Note: this model keeps data sets in the traditional data table format (no need to re-format data); additionally the DV does not have an actual column (it is instead the numeric measurements at the meal variables) 1) Create a vector of levels for the measurement points (1 for each measurement point): meals <- c(1, 2, 3) Where meals is the new vector name (a factor), and the numbers represent each measurement point. 2) Create a within groups measurement point factor to house the levels you just created (this will be used later in our data frame(matrix style)) and in our Anova anaylsis): mealFactor<- as.factor(meals) Where mealFactor is the new factor with n levels to house our levels that describe our n numeric columns(measurement points). 3) Create a matrix style data frame from the factor and levels that will be used to describe our numeric columns(measurement points): mealFrame <- data.frame(mealFactor) 4) Now create a bound vector containing the n numeric columns for later use in the linear model: mealBind<-cbind(breakfast , lunch, dinner) 5) Create a linear model with the bound vector you just created. mealModel<-lm(mealBind~1) 6) Use the Anova function from the car package to analyze our data (notice we are using the measurement point matrix style data frame and corresponding within groups factors as well as the linear model we just created): analysis3 <- Anova(mealModel, idata = mealFrame, idesign = ~mealFactor) Note: we could have added the argument ,type=III but the default of Anova is to switch from type II to type III SS when there is only one intercept 7) Now create a summary of the anova tables and information: summary(analysis) Look below at the summary: 166 | P a g e Data Set

Possible Errors for small DF Your tutorial is excellent. I was able to follow it easily and quickly analyze a data set I've been working with for a long time. I tried applying the same steps to another data set but when I tried to use the Anova(mod, idata, idesign) function I got the following error message: Error in linearHypothesis.mlm(mod, hyp.matrix, SSPE = SSPE, idata = idata, : The error SSP matrix is apparently of deficient rank = 3<4 Do you have any idea what this means or how to deal with it. Thanks a lot! John M. Quick said... Thanks for the comments. I am familiar with this error. In short, it has to do with a combination of a lack of degrees of freedom to execute the multivariate tests (i.e. small sample size compared to variables) and the inability of the Anova() function to ignore/forgo calculating the multivariate tests. See this R listserv discussion for details: http://r.789695.n4.nabble.com/Anova-in-car-SSPEapparently-deficient-rank-tp997619p997619.html

An alternative, which will get you the GreenhouseGeisser and Hyunh-Feldt epsilon corrections, but no multivariate tests, is to use the anova() function. anova(ageModel, idata = ageFrame, X = ~ageFactor, test = "Spherical") One caveat, I believe, is that this will use Type I SS, whereas my Anova() example uses Type III SS. I'm not sure how to get Type III SS with the anova() function.

167 | P a g e

A 3 time within and between factors model (IV-Subject[categorical and random]; IVGender[categorical and fixed]; Instructor Type[categorical and fixed]; DV-Cholesterol Rating[numeric]) Note: this model keeps data sets in the traditional data table format (no need to re-format data); additionally the DV does not have an actual column (it is instead the numeric measurements at the meal variables) Data Set 1) Create a vector of levels for the measurement points (1 for each measurement point): instructor <- c(1, 2, 3) Where meals is the new vector name (a factor), and the numbers represent each measurement point. 2) Create a within groups measurement point factor to house the levels you just created (this will be used later in our data frame(matrix style)) and in our Anova anaylsis): instructorF<- as.factor(instructor) Where instructorF is the new factor with n levels to house our levels that describe our n numeric columns(measurement points). 3) Create a matrix style data frame from the factor and levels that will be used to describe our numeric columns(measurement points): instructorFR <- data.frame(instructorF) 4) Now create a bound vector containing the n numeric columns for later use in the linear model: instructorBind<-cbind(male, female, computer) 5) Create a linear model with the bound vector you just created. LMmodel<-lm(instructorBind~gender) Notice we have included the fixed between groups gender variable in the linear model. 6) Use the Anova function from the car package to analyze our data (notice we are using the measurement point matrix style data frame and corresponding within groups factors as well as the linear model we just created): Analysis7 <- Anova(LMmodel, idata = instructorFR, idesign = ~instructorF) 7) Now create a summary of the anova tables and information: summary(analysis7) Look below at the summary and how the information was placed in an anova table: 168 | P a g e

>instructor<- c(1, 2, 3) > instructorF<- as.factor(instructor) > instructorFR<- data.frame(instructorF) > instructorBind<-cbind(male ,female ,computer) > LMmodel<-lm( instructorBind~gender ) > analysis4 <- Anova(LMmodel, idata = instructorFR, idesign = ~instructorF) > summary(analysis4)

Analysis of Variance Table Source Grand Mean Student Gender(A) Instructor Type(B) AB Interaction Student within Total SS 2022.78 3.56 92.29 37.16 47.07 2668.25 df 1 1 2 2 16 32 54 **p<.01 3.56 46.14 18.58 2.94 2.04
1. MS=SS/df 2. To get totals sum the columns

MS

F 1.21 22.58** 9.09**

Student x Instructor 65.39

.01 Critical Values: F1,16 -8.53, F2,32 5.387

169 | P a g e

LINEAR MODELING Linear Model m1 <- lm(gl~sc,data=df) summary(m1) Note: m1 is changeable, gl and sc are the numeric variables. Resistant Linear Model lqs(gl~sc,data=df) For models with outliers consider this model. Uses least median squares(LMS) & least trimmed squares (LTS). Robust Linear Model rlm(gl~sc,data=df) For models with outliers and heteroscedasticity problems consider this model. Graph comparing the regression line of the three models
Notice how the outliers affect lm(), less with lqs(), and least with rlm() lm() lqs() rlm()
> mtcars2 Mazda RX4 Mazda RX4 Wag Datsun 710 Hornet 4 Drive Hornet Sportabout Valiant Duster 360 Merc 240D Merc 230 Merc 280 Merc 280C Merc 450SE Merc 450SL Merc 450SLC Cadillac Fleetwood Lincoln Continental Chrysler Imperial Fiat 128 Honda Civic Toyota Corolla Toyota Corona Dodge Challenger AMC Javelin Camaro Z28 Pontiac Firebird Fiat X1-9 Porsche 914-2 Lotus Europa Ford Pantera L Ferrari Dino Maserati Bora Volvo 142E mp800 DFxz00 mpg 21.0 21.0 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 17.8 16.4 17.3 15.2 10.4 10.4 14.7 32.4 30.4 33.9 21.5 15.5 15.2 13.3 19.2 27.3 26.0 30.4 15.8 19.7 15.0 21.4 16.0 12.0 wt 2.620 2.875 2.320 3.215 3.440 3.460 3.570 3.190 3.150 3.440 3.440 4.070 3.730 3.780 5.250 5.424 5.345 2.200 1.615 1.835 2.465 3.520 3.435 3.840 3.845 1.935 2.140 1.513 3.170 2.770 3.570 2.780 9.000 3.700

library(MASS)

library(MASS)

mpg 10 15 20

25

30

4 wt

CODE FOR THE GRAPH ABOVE


mtcars2<-data.frame(rbind(mtcars[,c(1,6)],"mp800"=c(16.0,9),"DFxz00"=c(12.0,3.7))) mod<-lm(mpg~wt,data=mtcars2);library(MASS) mod2<-lqs(mpg~wt,data=mtcars2) mod3<-rlm(mpg~wt,data=mtcars2) plot(mod);par(ask=T);library(mvoutlier) aq.plot(mtcars2[c("mpg","wt")]);par(ask=T) uni.plot(mtcars2[c("mpg","wt")],symbol=T) influence.measures(mod) par(ask=T);par(mfrow=c(1,1)) with(mtcars2,plot(wt,mpg)) abline(reg=mod,lty=1,col="blue") abline(reg=mod2,lty=2,col="red") abline(reg=mod3,lty=3,col="green") legend(x=6.72,y=33.67,legend=c("lm()","lqs()","rlm()"), lty=c(1,2,3),col=c("blue","red","green")) mtext("Notice how the outliers affect lm(), less with lqs(), and least with rlm()", font=4,side=3,col="dark green")

170 | P a g e

Calling Components of Linear Models and Summaries Use model$and one of the components below [use names(model)to view these]
1. # 2. # 3. # [1] "coefficients" "residuals" [5] "fitted.values" "assign" [9] "xlevels" "call" "effects" "qr" "terms" "rank" "df.residual" "model"

Use summary(model)$and one of the components below [names(summary((model))to view]


1. 2. 3. 4. names(summary(fit)) # [1] "call" "terms" # [5] "aliased" "sigma" # [9] "adj.r.squared" "fstatistic" "residuals" "df" "cov.unscaled" "coefficients" "r.squared"

Accessor Functions residuals(model);resid(model) coefficients(model);coef(model) fitted(model) predict(model) deviance(model) df.residual(model) rstandard(model) rstudent(model) influence.measures(model) influence(model) dfbeta(model) dfbetas(model) covratio(model) cooks.distance(model) hatvalues(model) BIC(model) AIC(model) model.frame(model)

171 | P a g e

Dealing with multicolinearity method 1 lm.ridge() From the library(MASS) it attempts to minimize SS residuals and penalizes for coefficient sizes Dealing with multicolinearity method 2 lars(x,y,type= "lasso") Where x is a matrix of predictor values and y is the response variable. From the library(lars) it penalizes for coefficient sizes differently than lm.ridge using algorithm for least angle regression. Dealing with multicolinearity method 3 pcr(formula) From the library(pls) it transforms the predictors and then linear regression is performed. Dealing with multicolinearity method 4 plsr (formula) From the library(pls) it uses partial regression coefficients and then linear regression is performed. Dealing with multicolinearity method 5 mean centering according to aiken and west Linear Model Hypothesis Test For Simple Linear Regression See: ex21c.docx Regression Analysis Use the source code below for calling Regression and Correlations Functions:
source("C:/Users/Rinker/Desktop/PhD Program/CEP 523-Stat Meth Ed Inference/R Stuff/Scripts/Regression/.Regression Bundle.txt") rfun() source("C:/Users/Rinker/Desktop/PhD Program/CEP 523-Stat Meth Ed Inference/R Stuff/Scripts/Regression/Correlation Matrix and Data Visualization.txt") cmat()

172 | P a g e

Calculate Multiple Regression from a correlation matrix library(psych) mat.regress(m, x, y,n.obs=NULL) Multiple Regression from matrix input
Arguments m a matrix of correlations or, if not square of data x either the column numbers of the x set (e.g., c(1,3,5) or the column names of the x set (e.g. c("Cubes","PaperFormBoard") either the column numbers of the y set (e.g., c(2,4,6) or the column names of the y set (e.g., c("Flags","Addition")

n.obs If specified, then confidence intervals, etc. are calculated, not needed if raw data are given

Allows you to calculate multiple regression from correlation matrix. Useful for interpreting results from someone elses work.
require(foreign) data<-read.spss("data3_Revised.sav", use.value.labels = TRUE, to.data.frame = TRUE) dat <- subset(data, select=c(achiev,momedu,ses,parsupport, parmonitor,rules )) dat3 <-subset(dat, select=c(achiev,parsupport, parmonitor,rules)) dat3<-na.omit(dat3) mod7 <- with(dat3, lm(achiev ~parsupport + parmonitor + rules)) test.data<-cor(dat3) x<-mat.regress(test.data,c(2,3,4),c(1),n.obs=478)#choose the variables by number summary(x,digits=4) #note gives standardized beta weights #compare to: summary(mod7);mod7 stan.beta(mod7)

Computes the confidence interval for a desired level for the squared-multiple correlation CI.Rsq(rsq, n, k, level = 0.95) This function you have to enter the R2 but not the function below

173 | P a g e

Computes the confidence interval for a desired level for the squared-multiple correlation CI.Rsqlm(obj, level = 0.95) library(psychometric)

Where obj is the linear model (ie. obj<-lm(y~x1+x2) and level is the confidence interval desired. Arguments R Correlation Coefficient n Sample Size level Significance Level for constructing the CI, default is .95

Predicting from a model sPREDICT predict(object, data)


#EXAMPLE (mod <- lm(mpg~hp+disp+hp:disp, data=mtcars)) NEW<-data.frame(hp=c(260, 280), disp=c(330, 350)) predict(mod, NEW)

174 | P a g e

One Factor/One Continuous ANCOVA Example:


#============================================================================================================ #GETTING THE DATA #============================================================================================================ regrowth<-read.table("ipomopsis.txt", header=TRUE, sep="\t",na.strings="999") attach(regrowth) names(regrowth) head(regrowth) #COVARIATE-->"Root"/OUTCOME-->"Fruit"/CATEGORICAL-->"Grazing" #============================================================================================================ #LOOKING AT MEANS #============================================================================================================ mean(subset(regrowth,Grazing=="Ungrazed")$Fruit) mean(subset(regrowth,Grazing=="Grazed")$Fruit) #............................ #Looking at means we would suggest that the grazed plants actually produce more fruit (incorect assumption as the plot will show) #............................ #============================================================================================================ #PLOTTING THE DATA #============================================================================================================ plot(Root,Fruit,pch=16+as.numeric(Grazing),col=c("blue","green")[as.numeric(Grazing)]) #............................ #A look at the lines reveals ungrazed acually produces more fruit, opposite of what the means suggests #+16as.numeric is what turns the categorical data into plot points [16 changes the point type] #............................ abline(lm(Fruit[Grazing=="Grazed"]~Root[Grazing=="Grazed"]),lty=15,col="blue") abline(lm(Fruit[Grazing=="Ungrazed"]~Root[Grazing=="Ungrazed"]),lty=3,col="dark green") legend(locator(1),c("Grazed","Ungrazed"),fill=c("blue","dark green")) #............................ #draws the regression lines for each group of Grazing as described by the covariate roots #............................ #============================================================================================================ #ANALYZING THE DATA (ANCOVA) #============================================================================================================ ancova.fruit<-lm(Fruit~Grazing*Root) #............................ #covariates go second, because we are not interested in their effects, just the addition error they remove and the power they give #order matters here: anova(lm(Fruit~Root*Grazing)) will give a different output #............................ summary(ancova.fruit) anova(ancova.fruit)

175 | P a g e

Two Factor/One Continuous ANCOVA Example:


#============================================================================================================ #GETTING THE DATA #============================================================================================================ Gain<-read.table("Gain.txt", header=TRUE, sep="\t",na.strings="999") attach(Gain) names(Gain) head(Gain) #COVARIATE-->"Age"/OUTCOME-->"Weight"/CATEGORICAL-->"Sex/"CATEGORICAL-->"Genotype" #============================================================================================================ #LOOKING AT MEANS #============================================================================================================ #............................ #method 1 #............................ library(doBy) summaryBy() summaryBy(Weight~ Sex+Genotype, data = Gain,FUN = function(x) { c(n = length(x),mean = mean(x), sd = sd(x)) } ) #............................ method 2 #............................ source("C:/Users/Rinker/Desktop/PhD Program/CEP 523-Stat Meth Ed Inference/R Stuff/Scripts/Regression/.Regression Bundle.txt") rfun() desc2v(Gain,Sex,Genotype) #============================================================================================================ #ANALYZING THE DATA (ANCOVA) #============================================================================================================ ancova.gain<-lm(Weight~Sex*Age*Genotype) summary(ancova.gain) anova(ancova.gain)

176 | P a g e

Basic Functions

Generic Functions

177 | P a g e

Find the probability of obtaining the same result if the experiment were conducted again library(psych) Usage p.rep(p = 0.05, n=NULL,twotailed = FALSE) p.rep.f(F,df2,twotailed=FALSE) p.rep.r(r,n,twotailed=TRUE) p.rep.t(t,df,df2=NULL,twotailed=TRUE) Arguments p conventional probability of statistic (e.g., of F, t, or r) F The F statistic df Degrees of freedom of the t-test, or of the first group if unequal sizes df2 Degrees of freedom of the denominator of F or the second group in an unequal sizes t test r Correlation coefficient n Total sample size if using r t t-statistic if doing a t-test or testing significance of a regression slope twotailed Should a one or two tailed test be used?

178 | P a g e

Critical Values (p=1- or 1-/2 [1 or 2 tail], df= degrees of freedom, q=critical value) Function qnorm(p) What it does Returns a value q such that the area to the left of q for a standard normal random variable is p. Returns a value p such that the area to the left of q on a standard normal is p. Returns a value q such that the area to the left of q on a t(df) distribution equals q. Returns p, the area to the left of q for a t(df) distribution

pnorm(q)

qt(p,df)

pt(q,df)

qf(p,df1,df2) Returns a value q such that the area to the left of q on a F(df1, df2) distribution is p. For example, qf(.95,3,20) returns the 95% points of the F(3, 20) distribution. pf(q,df1,df2) Returns p, the area to the left of q on a F(df1, df2) distribution. qchisq(p,df) Returns a value q such that the area to the left of q on a 2(df) distribution is p. Returns p, the area to the left of q on a 2(df) distribution.

pchisq(q,df)

The birthday paradox Find the # of occurrence given the probability of the event qbirthday(prob = 0.5, classes = 365, coincident = 2) Find the probability of given the # of occurrences of the event pbirthday(n, classes = 365, coincident = 2)

EXAMPLES qbirthday(prob = .95, classes = 365, coincident = 2) pbirthday(23, classes = 365, coincident = 2)

179 | P a g e

Function Writing Information LOOPS for() Function Using a for Loop Example Repeat a function over and over again
X<-function(col=10, rows=40){ vec <- 1:col holder <-c() for (i in 1:rows){ perm <- sample(vec, replace=F) holder <- rbind(holder, perm) } holder rownames(holder)<-paste("obs. ", 1:rows, sep="") colnames(holder)<-paste("VAR-",LETTERS[1:col], sep="") holder } X(15, 20) #=========================================== #loop to repeat a function #=========================================== DFer = list() n = 10 j=6 for (i in 1:n){ DFer[[i]]= data.frame(A=1:j, B=rnorm(j), C=letters[1:j]) } DFer #=========================================== #or (the 2nd allocates the vector ahead of time) #=========================================== getDFs <- function(n, j) { df <- vector("list", n) # As I said, if you know the size, allocate the object beforehand for (i in seq(n)) df[[i]] <- data.frame(A = seq(j), B = rnorm(j), C = letters[seq(j)]) return(df) } # end function (x<-getDFs(10, 4)) #=========================================== #put it together (the list into a data frame #=========================================== do.call("rbind", x) library(plyr) ldply(x, rbind)

Infinite Repeat Loop


i <- 1 repeat{ i <- i/2 print(i) flush.console() }

Repeat Loop
i <- 1 repeat{ i <- i/2 print(i) flush.console() if (i < .0005) break }

For Loop with next


i <- 0 for (i in 1:100){ if (i%%2==0) next i <- i +1 print(i) flush.console() }

While Loop
i <- c(1) while(i < 20){ i <- c(i, i*1.5) print(i) flush.console() }

For Loop with break and next


i <- 0 for (i in 1:100){ if (i%%2==0) next if (i > 90) break i <- i +1 print(i) flush.console() }

Nested For Loop


for (i in 1:2){ for(j in 20:21){ for (k in c("horse", "cow")){ print(i) print(j) print(i*j) print(k) } } }

180 | P a g e

IFELSE ifelse(test,then this occurs, if not this happens) Example


x<-sample(-2:5,20,replace=T);x outcome<-ifelse(x >= 0, sqrt(x), NA) data.frame(x,outcome)

Switch Function Example 1


Central <- function(y, measure = "Mean"){ switch(measure, Mean = mean(y), Geometric = exp(mean(log(y))), Harmonic = 1/mean(1/y), Median = median(y), stop("chose a mean") ) } central(mtcars$mpg,"Median") central(mtcars$mpg,"Geometric") central(mtcars$mpg,"Harmonic")

Example2
FUN <- function(x){ switch(x, `1` = "A", `2` = "B", `3` = "C", stop("chose a # between 1-3") ) } FUN(1) FUN(2) FUN(4)

181 | P a g e

Repeat Loops (complex example) srepeat EXAMPLE:


#FIRST I'LL RECREATE A DATA SET. IT"LL CONTAIN REDUNDANCY DATA <- structure(list(person = structure(c(4L, 1L, 5L, 4L, 1L, 3L, 1L, 4L, 3L, 2L, 1L), .Label = c("greg", "researcher", "sally", "sam", "teacher"), class = "factor"), sex = structure(c(2L, 2L, 2L, 2L, 2L, 1L, 2L, 2L, 1L, 1L, 2L), .Label = c("f", "m"), class = "factor"), adult = c(0L, 0L, 1L, 0L, 0L, 0L, 0L, 0L, 0L, 1L, 0L), state = structure(c(2L, 7L, 9L, 11L, 5L, 4L, 8L, 3L, 10L, 1L, 6L), .Label = c("Shall we move on? Good then.", "Computer is fun. Not too fun.", "I distrust you.", "How can we be certain?", "I am telling the truth!", "Im hungry. Lets eat. You already?", "No its not, its ****.", "There is no way.", "What to do?", "What are you talking about?", "You liar, it stinks!"), class = "factor")), .Names = c("person", "sex", "adult", "state"), class = "data.frame", row.names = c(NA, -11L)) DATA <- data.frame(rbind(DATA, DATA, DATA)) DATA <- data.frame(rbind(DATA, DATA, DATA)) DATA <- data.frame(tot = 1:nrow(DATA), DATA) DATA <- DATA[with(DATA, order(person, tot)), ] rownames(DATA)<-1:nrow(DATA) #========================================== #A SIMPLE WORD COUNT FUNCTION word.count <- function(text, by = "row") { unblanker <-function(x)subset(x, nchar(x)>0) word.split <- function(x) sapply(x, function(x)as.vector(unlist(strsplit(x, " ")))) reducer <- function(x) gsub("\\s+", " ", x) txt <- sapply(as.character(text), function(x) ifelse(is.na(x), "", x)) OP <- switch(by, all = length(unblanker(unlist(word.split(reducer(unlist(as.character(txt))))))), row = sapply(txt, function(x) length(unblanker(unlist(word.split(reducer(unlist(as.character(x))))))))) ifelse(OP==0, NA, OP) } #========================================== DATA$wc <- word.count(DATA$state) #========================================== #METHOD 1 DASON g <- function(x, k = 30){ # Need to know how long the final vector should be n <- length(x) # Take care of case where we don't get any groups. if(sum(x) < k){ ans <- rep(NA, n) return(factor(ans)) } # Store where we want to break into a new group breaks <- c() repeat{ # Find the first spot where the vector sums to at least k # If that doesn't happen we get Inf and a warning # suppress the warning spot <- suppressWarnings(min(which(cumsum(x) >= k))) # If we got inf then the sum couldn't reach k if(!is.finite(spot)){ break # Jump out of the repeat loop } # Remove the spots that we accounted for x <- x[-c(1:spot)] # Note where the break is

182 | P a g e

breaks <- c(breaks, spot) } ans <- rep(NA, n) groups <- paste("sel_", rep(1:(length(breaks)), breaks), sep = "") ans[1:length(groups)] <- groups return(factor(ans)) } # Try it out x <- subset(DATA, person=='sam') g(x$wc) y <- subset(DATA, person=='teacher') g(y$wc) z <- subset(DATA, person=='greg') g(z$wc) #METHOD 2 BRYANGOODRICH f <- function(x) { # Initialize variables r <- length(x) # Total size n <- 1 # Starting index i <- 1 # Group index sum <- 0 # Container for sum check groups <- vector("character", r) # Loop through r-length vector x for (m in seq(r)) { sum <- sum + x[m] # Add to running sum isEnough <- sum >= 30 # Block allocation and index adjustments if (isEnough) { groups[n:m] <- paste("sel_", i, sep ="") i <- i + 1 # Increment group index n <- m + 1 # Increment start index sum <- 0 # Start sum check over } # end if isEnough # end for m

groups[groups == ""] <- NA return (factor(groups)) # end function

# Try it out x <- subset(DATA, person=='sam') f(x$wc) y <- subset(DATA, person=='teacher') f(y$wc) z <- subset(DATA, person=='greg') f(z$wc)

183 | P a g e

Menu (Interactive Mode) menu(choices, graphics = FALSE, title = NULL) Arguments choices- a character vector of choices graphics- a logical indicating whether a graphics menu should be used if available. title- a character string to be used as the title of the menu. NULL is also accepted.
EXAMPLES switch(menu(c("List letters", "List LETTERS", "What does this do?")) + 1, cat("Nothing done\n"), letters, LETTERS,c("Oh now I get it")) #===================================================================== .hurtz.donut<-function(){ cat("You want a hurtz donut?\n\n") switch(menu(c("Yes", "No")) , cat("<PUNCH>\nHurts, don't it?\n"), cat("What a wimp!!\n")) } .hurtz.donut()

Menu (gwidgets style) menu(choices, title, graphics = TRUE) select.list(choices, title)


menu(sort(.packages(all.available = TRUE)), title = "packages", graphics = TRUE) #returns --> [1] 17 select.list(sort(.packages(all.available = TRUE)), title = "packages") #returns --> [1] "car"

184 | P a g e

Progress Bars (base) see also: tcltk , plyr, RGtk2, txtProgressBar(min = 0, max = 1, initial = 0, char = "=", width = NA, title, label, style = 1, file = "") winProgressBar(title = "R progress bar", label = "", min = 0, max = 1, initial = 0, width = 300) close() #needed after the call to the progress bar
#EXAMPLE PASSING SEQUENCE ALONG THE VECTOR total = nrow(mtcars) progress.bar = TRUE type = FALSE #progress.bar = FALSE #parameter to play with #type = 'text' #parameter to play with if(progress.bar) { if (Sys.info()[['sysname']]=="Windows" & type != "text"){ # create progress bar pb <- winProgressBar(title = "progress bar", min = 0, max = total, width = 300) lapply(1:total, function(i) { Sys.sleep(.5) setWinProgressBar(pb, i, title=paste(round(i/total*100, 0), "% done")) } ) close(pb) } else { # create progress bar pb <- txtProgressBar(min = 0, max = total, style = 3) lapply(1:total, function(i) { Sys.sleep(.5) setTxtProgressBar(pb, i) } ) close(pb) } } else { Sys.sleep(total/4) return("should have used a progress bar") }
#EXAMPLE PASSING THE VECTOR With Global Assignment w <- c("raptors are awesome don't you all agree") y <- unlist(strsplit(w, " ")) total <- length(y) #WINDOWS VERSION pb <- winProgressBar(title = "progress bar", min = 0, max = total, width = 300) i <- 0 lapply(y, function(x){ z <- nchar(x); Sys.sleep(.5) i <<- i + 1 setWinProgressBar(pb, i, title= paste(round(i/ total *100, 0), "% done")) return(z) } ) close(pb) #STANDARD TEXT VERSION pb <- txtProgressBar(min = 0, max = total, style = 3) i <- 0 lapply(y, function(x){ z <- nchar(x); Sys.sleep(.5) i <<- i + 1 setTxtProgressBar(pb, i) return(z) } ) close(pb) #EXAMPLE PASSING THE VECTOR w <- c("raptors are awesome don't you all agree") y <- unlist(strsplit(w, " ")) total <- length(y) #WINDOWS TEXT BAR pb <- winProgressBar(title = "progress bar", min = 0, max = total, width = 300) lapply(y, function(x){ z <- nchar(x); Sys.sleep(.5) i <- which(y %in% x) setWinProgressBar(pb, i, title= paste(round(i/total *100, 0), "% done")) return(z) } ) close(pb) #STANDARD TEXT BAR pb <- txtProgressBar(min = 0, max = total, style = 3) lapply(y, function(x){ z <- nchar(x); Sys.sleep(.5) i <- which(y %in% x) setTxtProgressBar(pb, i) return(z) } ) close(pb)

= portion that is function dependent

185 | P a g e

Pass a data frame to a function (One method)


f <- function(x,data=NULL, fun) { fun(eval(match.call()$x,data)) } f(hp,mtcars,mean)

Passing a Variable (Vector name) on to an argument as a character Best seen with an examples. See below. Four is passed on without using quotes.
SIMPLE EXAMPLE
extract.arg <-function (a){ s <- substitute(a) as.character(s) } extract.arg(hello)

EXAMPLE WITH DATA mtcars2<-mtcars library(doBy) mtcars2$cyl<-with(mtcars2,recodeVar(mtcars2$cyl,src=c(4,6,8), tgt=c("four","six","eight"), default=NULL, keep.na=TRUE)) with(mtcars2,cyl) Tfun <-function (DV,IV,group1){ g <- substitute(group1) g1<-DV[IV ==as.character(g)] p <- mean(g1) list(g1,p) } with(mtcars2,Tfun(mpg,cyl,four))

Passing a character string to a function eval parse Best seen with an examples. See below.
#EXAMPLE x <- c(1:20) myoptions <- "trim=0, na.rm=FALSE" eval(parse(text = paste("mean(x,", myoptions, ")"))) library(fortunes);fortune(106)

186 | P a g e

Functions That Take Input scan(n=,what = double(0),quiet=T)


EXAMPLE ASKING FOR 1 INPUT x<-function(){ #choose angle in degrees cat("\n","Enter Value","\n") x<-scan(n=1,what = double(0),quiet=T) x } EXAMPLE ASKING FOR 4 INPUT x<-function(){ #choose angle in degrees cat("\n","Enter Value","\n") x<-scan(n=4,what = double(0),quiet=T) x }

187 | P a g e

Warning Messages warning(..., call. = TRUE, immediate. = FALSE, domain = NULL) suppressWarnings()

EXAMPLE test <- function() warning("You idiot you forgot quotes!") test() ## shows call test2 <- function() warning("You idiot you forgot quotes!", call. = FALSE) test2() ## no call

Alarm alarm() OR cat("\a")

#makes a call to "\a"

Tell a function what to do with a missing argument missing(x) #Best understood with an example

EXAMPLE myplot <- function(x,y) { if(missing(y)) { y <- x x <- 1:length(y) } plot(x,y) }

Used as a minifunction within the function to tell what to do if y is not given. Technically this could be done with myplot <- function(x,y=x) as well. Reset Parameters useful for resetting graphical parameters or performing cleanup actions.
textClick <- function(express, col="black", cex=NULL, srt = 0, family="sans", ...){

on.exit()

old.par <- par(no.readonly = TRUE) on.exit(par(old.par)) par(mar = rep(0, 4),xpd=NA) x<-locator(1) X<-format(x, digits=3) text(x[1], x[2], express, col=col, cex=cex, srt=srt, family=family, ...) noquote(paste(X[1], X[2],sep=", ")) }

188 | P a g e

Sequence for the n of a vector Traditionally people use: 1:length(x) However this may lead to problems. Use instead: seq_along(x)

189 | P a g e

Viewing the code of generic functions look at a functions code look at a function's code If a function is generic or one youve created (downloaded) you can view its code by simply typing the name of the function: For the aov() function type: aov (and enter)

Viewing the code of generic functions Type method(function). This gives a list of the functions with suffixes. Now type the function with the suffix name for its code.
methods(anova) anova.glm

How R evaluates true and false In R, TRUE is considered to be the number 1 and FALSE is considered the number 0. This can be very useful in practice. Example: T+T+F=2 T*T*F=0

190 | P a g e

Determine how much memory an object takes up object.size(x,units=) Units can be changed units = c("b", "auto", "Kb", "Mb", "Gb") EXAMPLEs
print(object.size(library(base)),units="auto") #specific library print(object.size(library),units ="auto") #entire library print(object.size(Tpass),units ="auto") #a function print(object.size(ls()),units ="auto") #current objects in workspace

Determine memory allocation and Increase Allocation memory.limit() memory.limit(size=3500) #Report memory limit #increase memory limit

Reduce Objects and Junk in Memory gc() rm(list=ls()) rm(list = ls(all.names = TRUE))

191 | P a g e

Timing Determine How Long It Takes to Run a Function (method 1) library(microbenchmark) [very accurate]
microbenchmark(..., list, times=100, control=list())

192 | P a g e

Determine How Long It Takes to Run a Function (method 2) library(rbenchmark) benchmark( ..., columns = c( "test", "replications", "elapsed", "relative", "user.self", "sys.self", "user.child", "sys.child"), order = "test", replications = 100, environment = parent.frame()) Arguments
... columns order

function timing time a function function time

captures any number of unevaluated expressions passed to benchmark as named or unnamed arguments (the functions to be teststed). a character or integer vector specifying which columns should be included in the returned data frame (see below). a character or integer vector specifying which columns should be used to sort the output data frame. Any of the columns that can be specified for columns (see above) can be used, even if it is not included in columns and will not appear in the output data frame. If order=NULL, the benchmarks will appear sorted by the order of the expressions in the call to benchmark.

replications a numeric vector specifying how many times an expression should be evaluated when the runtime is measured. If replications consists of more than one value, each expression will be benchmarked multiple times, once for each value in replications. environment the environment in which the expressions will be evaluated.
#example and output benchmark( Plyr = ddply(mtcars, .(cyl, gear), summarise, output = mean(hp)), Tapply = with(mtcars, data.frame(output = tapply(hp, interaction(cyl, gear), mean))), Aggregate = aggregate(hp ~ cyl + gear, mtcars, mean), order=c('replications', 'elapsed')) # test replications elapsed relative user.self sys.self user.child sys.child # 2 Tapply 100 0.19 1.000000 0.18 0 NA NA # 3 Aggregate 100 0.51 2.684211 0.39 0 NA NA # 1 Plyr 100 1.36 7.157895 1.10 0 NA NA a functi onun c

Explanation: Typically just look at the elapsed time and the relative times. The relative is really what is interesting to me - it tells you how long each expression takes in comparison to the fastest expression. So in your example the Tapply is the quickest and Aggregate takes 2.68 times longer and the Plyr solution takes 7.15 times longer than the Tapply. Determine How Long It Takes to Run a Function (method 3) function timing system.time()
tion

time a function function time

193 | P a g e

Timer 1 library(data.table) begin.time <- Sys.time() timetaken(begin.time) Timer 2 library(matlab) tic(gcFirst=FALSE) toc(echo=TRUE) Timer 3 base x <- Sys.time() difftime(Sys.time(), x)

Time Stamping timestamp()

194 | P a g e

Generate Reproducible Code Write a code, data frame to a file to send to a help archive (reproducible code) sink dput(object, "file to write to") EXAMPLE
dput(mtcars, "foo.txt")

SEE ALSO: Exporting an output to a file section using cat() Generate window of a code, data frame to send to a help archive (reproducible code) page(object) EXAMPLE
page(mtcars)

195 | P a g e

Customized Workflow

Create Multiple Working Directories Create a shortcut where you want a new directory. Locate (& cut [ctrl + c]) the location of where you stored the shortcut (ie. C:\Users\Rinker\Desktop\PhD Program\CEP 523-Stat Meth Ed Inference\R Stuff). Click on Properties and paste (ctrl + p) the location to the Start In box. Now data files from this location load automatically without referencing their specific location. Stop the Stupid Start Up Message and Auto Save At the end of the target box(see "Create Multiple Working Directories") location add a space and then -q --no-save
"C:\Program Files\R\R-2.13.0\bin\i386\Rgui.exe" -q --no-save

196 | P a g e

.First Function (start up commands) Open a new script from within [R]. Create a .First function with the following type of set up:
.First<-function(Sys.time){ library(psych) library(car) options(repos="http://lib.stat.cmu.edu/R/CRAN") source("C:/Users/Rinker/Desktop/PhD Program/CEP 523-Stat Meth Ed Inference/R Stuff/Scripts/Missing Values/.NA Bundle.txt") source("C:/Users/Rinker/Desktop/PhD Program/CEP 523-Stat Meth Ed Inference/R Stuff/Scripts/Assumption Testing/Tests of Normality.txt") options(repos="http://lib.stat.cmu.edu/R/CRAN") #See Choosing a CRAN Mirror cat("Hello Tyler! Today is",date(),"\n") }

Now save it to the working directory (see Creating A Working Directory) as .Rprofile NOTE: As of right now I have not been able to edit this file. I have to delete it and resave it using the method above to edit it. Choosing a CRAN Mirror Type the following to choose a CRAN mirror and to find its URL:
chooseCRANmirror() options("repos")[[1]][1] #gives a URL for the Mirror you just chose

Now paste the following command to the .First function in your .Rprofile
options(repos="URL")

See options .Options

197 | P a g e

Paths, Directories and System Info Check the System info and Operating System Sys.info() Sys.info()[['sysname']] Check System Path Infoe Sys.getenv("USERNAME") Sys.getenv("HOME") Sys.getenv()
>Sys.getenv("USERNAME") [1] "Rinker" > Sys.getenv("HOME") [1] "C:\\Users\\Rinker"

Determine working Directory/ Set Working Directory getwd() setwd(dir) Directory Functions dir() list.files() file.info(list.files()) path.expand("~/Desktop/PhD Program/") #list files in working directory #list files in working directory #get info on files in directory #replaces the tilde with user home directory

198 | P a g e

File Editing and Web Browsing

Opening Web Pages method 1 See the web() function in scripts browseURL(url, browser = getOption("browser"), encodeIfNeeded = FALSE)

Opening Web Pages method 2 shell.exec(url) Arguments: file = file or url to open

Not recommended as the function may not work on non-Windows machines.

Opening Files Within R See the ret() function in scripts shell.exec(file) Arguments: file = file or url to open Open text files for editing in R file.edit("file")
Example file.edit(".Rprofile")

199 | P a g e

Determine if a file exists file.exists(path) If you don't provide a path R only checks the wd() Rename a file file.rename(path) Delete a directory unlink(x, recursive = TRUE, force = FALSE) Format R code library(formatR) sformatr

sink(file="New.doc") 4*3; sink() file.exists("New.doc") Open() #look at the New.doc file.rename("New.doc", "Renamed.doc") file.exists("New.doc") Open() #look at the Renamed.doc delete(Renamed.doc) #user defined Open() #look at the no Renamed.doc

tidy.source(source, file.output="windows console")


#save it somewhere or copy and then read it in # check tidy.source's clipborad option library(formatR) xx<-pathPrep() C:\Users\Rinker\Desktop\transcript Functions.R shell.exec(xx) tidy.source(source = xx, file="transcript Functions.txt") tidy.source(source = "clipboard") #using a clipboard tidy.source(source = "clipboard", file="transcript Functions.txt")) #using a clipboard

200 | P a g e

Debugging Find out the values of a function up to a given point browser()


testfun <- function(x = 5){ y = 5 browser() print(x + y) } testfun() y; x; p Browse[1]> y; x; p #check the values of each of these objects within the function [1] 5 [1] 5 Error: object 'p' not found

Exit Browser Typer Q and enter Debug Use debug(Function_Name) and then use the function to step by step by step
debug(mean) mean(1:10) undebug(mean)

try and trycatch


L <- list(a=c(1, 3, 5), b=c("a", "v"), d=mtcars[,1]) lapply(L, function(x){ try(sum(x)) }) L <- list(a=c(1, 3, 5), b=c("a", "v"), d=mtcars[,1]) sapply(L, function(x){ tryCatch(sum(x), error=function(err) NA) })

201 | P a g e

Expand a column that's a list column spit


#==================================== # THE DATA FRAME #==================================== input <- data.frame(site = 1:6, sector = factor(c("north", "south", "east", "west", "east", "south")), observations = I(list(c(1, 2, 3), c(4, 3), c(), c(14, 12, 53, 2, 4), c(3),c(23)))) #==================================== # EXPAND THE COLUMN AND MERGE #==================================== obs.l <- sapply(input$observations, length) desire.output <- data.frame(site=rep(1:6,obs.l), obs=unlist(input$observations)) merge(input[, -3], desire.output, all.x=TRUE) #NOTE- THE SITE IS THE KEY FOR THEN MERGING WITH THE DATA FRAME

202 | P a g e

Expand a Text Column (Split by sentence)


sentSplit <- function(dataframe, text.var, splitpoint = NULL, rownames = numeric, text.place = original) { DF <input re <RN <TP <dataframe <- as.character(substitute(text.var)) ifelse(is.null(splitpoint), "[\\?\\.\\!]", as.character(substitute(splitpoint))) as.character(substitute(rownames)) as.character(substitute(text.place))

breakinput <- function(input, re) { j <- gregexpr(re, input) lengths <- unlist(lapply(j, length)) spots <- lapply(j, as.numeric) first <- unlist(lapply(spots, function(x) { c(1, (x + 1)[-length(x)]) })) last <- unlist(spots) ans <- substring(rep(input, lengths), first, last) return(list(text = ans, lengths = lengths)) } j <- breakinput(DF[, input], re) others <- DF[, -which(colnames(DF) == input)] idx <- rep(1:dim(others)[1], j$lengths) ans <- cbind(input = j$text, others[idx, ]) colnames(ans)[1] <- input if (RN == "numeric") { rownames(ans) <- 1:nrow(ans) } if (TP == "original") { ans <- ans[, c(colnames(DF))] } else { if (TP == "right") { ans <- data.frame(ans[, -1], ans[, 1]) colnames(ans)<-c(colnames(ans)[-ncol(ans)],input) } else { if (TP == "left") { ans } } } return(ans) } #===================== #TEST IT #===================== DATA<-structure(list(person = structure(c(4L, 1L, 5L, 4L, 1L, 3L, 1L, 4L, 3L, 2L, 1L), .Label = c("greg", "researcher", "sally", "sam", "teacher"), class = "factor"), sex = structure(c(2L, 2L, 2L, 2L, 2L, 1L, 2L, 2L, 1L, 1L, 2L), .Label = c("f", "m"), class = "factor"), adult = c(0L, 0L, 1L, 0L, 0L, 0L, 0L, 0L, 0L, 1L, 0L), state = structure(c(2L, 7L, 9L, 11L, 5L, 4L, 8L, 3L, 10L, 1L, 6L), .Label = c("Shall we move on? Good then.", "Computer is fun. Not too fun.", "I distrust you.", "How can we be certain?", "I am telling the truth!", "Im hungry. Lets eat. You already?", "No its not, its dumb.", "There is no way.", "What should we do?", "What are you talking about?", "You liar, it stinks!" ), class = "factor"), code = structure(c(1L, 4L, 5L, 6L, 7L, 8L, 9L, 10L, 11L, 2L, 3L), .Label = c("K1", "K10", "K11", "K2", "K3", "K4", "K5", "K6", "K7", "K8", "K9"), class = "factor")), .Names = c("person", "sex", "adult", "state", "code"), row.names = c(NA, -11L), class = "data.frame") sentSplit(DATA, sentSplit(DATA, sentSplit(DATA, sentSplit(DATA, state, rownames=sub) state) state, text.place=right) state, text.place=left)

203 | P a g e

Collapse A Text Column by A grouping Variable


group text 1 m Computer is fun. Not too fun. 2 m No its not, its dumb. 3 m What should we do? 4 m You liar, it stinks! 5 m I am telling the truth! 6 f How can we be certain? 7 m There is no way. 8 m I distrust you. 9 f What are you talking about? 10 f Shall we move on? Good then. 11 m Im hungry. Lets eat. You already?

1 2 3 4 5

group m Computer is fun. Not too fun. No its not, its dumb. What should we do? You liar, it stinks! f m There is f What are you talking about? Shall m Im hungry.

text I am telling the truth! How can we be certain? no way. I distrust you. we move on? Good then. Lets eat. You already?

#THE DATA dat <- structure(list(sex = structure(c(2L, 2L, 2L, 2L, 2L, 1L, 2L, 2L, 1L, 1L, 2L), .Label = c("f", "m"), class = "factor"), state = structure(c(2L, 7L, 9L, 11L, 5L, 4L, 8L, 3L, 10L, 1L, 6L), .Label = c("Shall we move on? Good then.", "Computer is fun. Not too fun.", "I distrust you.", "How can we be certain?", "I am telling the truth!", "Im hungry. Lets eat. You already?", "No its not, its dumb.", "There is no way.", "What should we do?", "What are you talking about?", "You liar, it stinks!"), class = "factor")), .Names = c("group", "text"), class = "data.frame", row.names = c(NA, -11L)) #METHOD 1 (A better choice) # Needed for later k <- rle(as.numeric(dat$group)) # Create a grouping vector id <- rep(seq_along(k$len), k$len) # Combine the text in the desired manner out <- tapply(dat$text, id, paste, collapse = " ") # Bring it together into a data frame data.frame(text = out, group = levels(dat$group)[k$val]) #METHOD 2 y <- rle(as.character(dat$group)) x <- y[[1]] dat$new <- as.factor(rep(1:length(x), x)) text <- aggregate(text~new, dat, paste, collapse = " ")[, 2] data.frame(text, group = y[[2]]) #Method 3 (combined) k <- rle(as.numeric(dat$group)); dat$id <- rep(seq_along(k$len), k$len) data.frame(sex=rle(as.character(dat$group))$val, aggregate(text~id, dat, paste, collapse=" "))

204 | P a g e

Enter unknow number of vectors to unnamed argument return names and math out put
#FAKE DATA a<-sample(50:90+0, 20, replace=TRUE) b<-sample(50:90+20, 20, replace=TRUE) d<-sample(50:90-20, 20, replace=TRUE) #FUNCTION THAT WORKS IF SUPPLYING A LIST AS THE UNNAMED ARGUMENT foo <- function(...){ # Get the names of the objects that were passed into the function x <- as.character(match.call())[-1] # Apply mean to every object passed in y <- sapply(list(...), mean) return(list(x, y)) } #TEST IT OUT foo(a,b,d) #THE OUTPUT [[1]] [1] "a" "b" "d" [[2]] [1] 72.90 92.80 51.75

Extract compinents from dots ()


f1 <- function(x, ...) substitute(...()) f2 <- function(x, ...) match.call(expand.dots=FALSE)$... f1(1, warning("Hmm"), stop("Oops"), cat("some output\n")) f2(1, warning("Hmm"), stop("Oops"), cat("some output\n")) #Dunlap's method #traditional match.call

Creating a Quasi-Package
#EXAMPLE #Create objects such as data sets and functions to include in the file .hurtz.donut<-function(){"You want a hurtz donut? Yes! <Punch> Hurts don't it?"} .hurtz.donut() (.FUNdat<-data.frame(cbind(LETTERS,1:26))) #Save the objects to the .RData file save(.hurtz.donut,.FUNdat,file="myFUNCTIONS.RData") #just to show everything has been wiped clean (this will delete all objects from your workspace) rm(list = ls(all.names = TRUE)) #Close out and reload [R] load("myFUNCTIONS.RData") .hurtz.donut() .FUNdat

205 | P a g e

Function returns return() print() invisible() Specifically tells the function what to return. If the return function is not given the last line of the code will be returned. Invisible is a feature for being able to recall a function created object but it is not automatically returned.
EXAMPLE Invisible test <- function(){ with(mtcars, plot(mpg~hp)) invisible(list("type1"="Shh! } x <- test() x$type1 x$type2

I'm invisible.","type2"="Real quiet now."))

EXAMPLE2 Invisible a <- data.frame(x=1:10,y=1:10) test <- function(z){ mean.x<-mean(z$x) nm <-as.character(substitute(z)) print(mtcars) invisible(list(mean.x, nm))} x <- test(a) x

Function returns extended (return some recall the rest later) Look at both examples
#========================================== #The original function that returns a list #========================================== test <- function(number=10){ XX <- number YY <- "hello" ZZ <- Sys.time() o <- list(x = XX, y = YY, z = ZZ) class(o) <- "stuff" return(o) } #================================================= #This makes the above return one piece of the list #================================================= print.stuff <- function(stuff){ print(stuff$z) } #================================================= #See the end results #================================================= (PP <- test()) #returns what was specified by print.stuff PP$y PP$x #recall the other components of the list test <- function(number=10){ XX <- number YY <- "hello" ZZ <- Sys.time() o <- list(x = XX, y = YY, z = ZZ, zz = "Recall Me") class(o) <- "stuff" return(o) } #================================================= #This makes the above return one piece of the list #================================================= print.stuff <- function(stuff){ list(print(stuff$z), print(stuff$x)) } #================================================= #See the end results #================================================= (PP <- test()) #returns what was specified by print.stuff PP$y PP$zz PP$z PP$x #recall the other components of the list

206 | P a g e

Rolling Math Functions rolling mean rolling median sapply(seq(x), function(i) MATH.FUNCTION(x[seq(i)]))
x<- mtcars$disp sapply(seq(x), function(i) sapply(seq(x), function(i) sapply(seq(x), function(i) sapply(seq(x), function(i) median(x[seq(i)])) mean(x[seq(i)])) range(x[seq(i)])) sd(x[seq(i)]))

207 | P a g e

Apply Family, PLYR & RESHAPE PLYR splyr


#Group by subgroups, find max of another variable by these subgroups, return those rows ################################################# ## A FAKE DATA SET LIKE THE ONE YOU DESCRIBE ## ################################################# DF <- structure(list(ID = structure(c(1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L), .Label = c("1", "2"), class = "factor"), var2 = structure(c(1L, 4L, 5L, 9L, 3L, 8L, 2L, 6L, 10L, 7L), .Label = c("B", "C", "F", "G", "H", "I", "M", "P", "W", "Z"), class = "factor"), var2.1 = c(-0.184379525153166, -1.42413441621445, -0.245741502747687, 0.762805889444348, -0.85561728601498, -0.358079724034542, -0.137767483903655, -0.952739149867607, 1.01227935773242, 0.0722005649132995), DF.DATE = structure(c(14662, 15556, NA, 14903, 15641, NA, 14970, 15625, 15075, 14819), class = "Date")), .Names = c("ID", "var2", "var3", "DATE"), row.names = c(NA, -10L), class = "data.frame") DF #view dataframe library(plyr) #get plyr ddply(na.omit(DF), .(ID), summarise, max = max(DATE)) #or ddply(na.omit(DF), "ID", summarise, max = max(DATE)) ddply(na.omit(DF), "ID", summarise, mean = mean(var3)) 1 2 3 4 5 6 7 8 9 10 ID var2 var3 DATE 1 B -0.184380 2010-02-22 1 G -1.424134 2012-08-04 1 H -0.245742 <NA> 1 W 0.762806 2010-10-21 1 F -0.855617 2012-10-28 2 P -0.358080 <NA> 2 C -0.137767 2010-12-27 2 I -0.952739 2012-10-12 2 Z 1.012279 2011-04-11 2 M 0.072201 2010-07-29

> ddply(na.omit(DF), .(ID), summarise, max = max(DATE)) ID max 1 1 2012-10-28 2 2 2012-10-12 > ddply(na.omit(DF), "ID", summarise, mean = mean(var3)) ID mean 1 1 -0.4253313 2 2 -0.0015067 X1 <- sample(1:20, 60, replace=TRUE) X2 <- X1*(1+sample(seq( .00, .2, .005), 60, replace=TRUE)) X3 <- as.factor(sort(sample(c("dog", "cat", "pig", "snake"), 60, replace=TRUE))) DF <- data.frame(X1, X2, X3) > ddply(na.omit(DF), .(X3), summarise, cor = cor(X1,X2)) X3 cor 1 cat 0.9970943 2 dog 0.9974141 3 pig 0.9959173 4 snake 0.9865586

library(plyr) ddply(na.omit(DF), .(X3), summarise, cor = cor(X1,X2)) #correlation by group stats <- function(x)c("mean"=mean(x), "med"=median(x), "sd"=sd(x), "var"=var(x), "n"=length(x)) ddply(na.omit(DF), .(X3), summarise, X1=stats(X1),X2=stats(X2)) > ddply(mtcars, .(cyl, am), with, each(min, mean, sd, max)(hp)) 1 2 3 4 5 6 cyl am min mean sd max 4 0 62 84.66667 19.65536 97 4 1 52 81.87500 22.65542 113 6 0 105 115.25000 9.17878 123 6 1 110 131.66667 37.52777 175 8 0 150 194.16667 33.35984 245 8 1 264 299.50000 50.20458 335

require(plyr) ddply(mtcars, .(cyl, am), with, each(min, mean, sd, max)(hp))

208 | P a g e

DF<-structure(list(car_id = c(500L, 500L, 500L, 500L, 500L, 500L, 501L, 501L, 501L, 501L, 501L, 501L, 501L, 502L, 502L, 502L, 502L, 502L, 502L), visitnum = c(40L, 50L, 60L, 100L, 110L, 120L, 40L, 50L, 60L, 100L, 110L, 120L, 150L, 40L, 50L, 60L, 100L, 110L, 120L), measurement = c(2301L, NA, NA, NA, NA, NA, 4480L, NA, NA, NA, NA, NA, 38570L, NA, NA, NA, NA, NA, 2560L)), .Names = c("car_id", "visitnum", "measurement"), class = "data.frame", row.names = c(NA, -19L)) DF library(plyr) DF$measurement2 <- DF$measurement #duplicate measurement column DF$measurement2[is.na(DF$measurement2)]<-0 #replace NA's with 0 FM <-function(x)ifelse(sum(x)-x[1]>x[1], 1, 0) #code to make a new column of 0 and 1 ddply(DF, .(car_id), transform, "flagmeasure" = FM(measurement2))[,-4] ddply(DF, .(car_id), summarise, "flagmeasure" = FM(measurement2))

> DF car_id visitnum measurement 1 500 40 2301 2 500 50 NA 3 500 60 NA 4 500 100 NA 5 500 110 NA 6 500 120 NA 7 501 40 4480 8 501 50 NA 9 501 60 NA 10 501 100 NA 11 501 110 NA 12 501 120 NA 13 501 150 38570 14 502 40 NA 15 502 50 NA 16 502 60 NA 17 502 100 NA 18 502 110 NA 19 502 120 2560

ddply(DF, .(car_id), transform, "flagmeasure" = FM(measurement2))[,-4] car_id visitnum measurement flagmeasure 1 500 40 2301 0 2 500 50 NA 0 3 500 60 NA 0 4 500 100 NA 0 5 500 110 NA 0 6 500 120 NA 0 7 501 40 4480 1 8 501 50 NA 1 9 501 60 NA 1 10 501 100 NA 1 11 501 110 NA 1 12 501 120 NA 1 13 501 150 38570 1 14 502 40 NA 1 15 502 50 NA 1 16 502 60 NA 1 17 502 100 NA 1 18 502 110 NA 1 19 502 120 2560 1

> ddply(DF, .(car_id), car_id flagmeasure 1 500 0 2 501 1 3 502 1

summarise, "flagmeasure" = FM(measurement2))

209 | P a g e

#======================== # The data #======================== test<-data.frame(group=c(rep(1,4),rep(2,5),3),day=c(0:3,0:4,0), measure=c(5,3,7,8,3,2,4,5,7,5)) (test1<-test) #======================== # With base (faster) #======================== test$diff <- unlist(by(test$measure, test$group, function(x){x - x[1]})) test$perchange <- unlist(by(test$measure, test$group, function(x){(x - x[1])/x[1]})) test #======================== # With plyr (slower) #======================== test<-test1 #reset test library(plyr) perch<-function(x){(x - x[1])/x[1]} differ<-function(x){x - x[1]} ddply(test, .(group), transform, diff=differ(measure)) ddply(test, .(group), transform, perchange= perch(measure)) ddply(test, .(group), transform, diff=differ(measure), perchange= perch(measure))

1 2 3 4 5 6 7 8 9 10

group day measure 1 0 5 1 1 3 1 2 7 1 3 8 2 0 3 2 1 2 2 2 4 2 3 5 2 4 7 3 0 5

1 2 3 4 5 6 7 8 9 10

group day measure diff perchange 1 0 5 0 0.0000000 1 1 3 -2 -0.4000000 1 2 7 2 0.4000000 1 3 8 3 0.6000000 2 0 3 0 0.0000000 2 1 2 -1 -0.3333333 2 2 4 1 0.3333333 2 3 5 2 0.6666667 2 4 7 4 1.3333333 3 0 5 0 0.0000000

210 | P a g e

test<-data.frame(person=c("A","A","A","A", "B","B",'C', 'C'),day=c(7,14,21,22, 7, 14, 7, 14), measure=c(112,0,500,600, 0, 0, 0, 50),temp=c(36.9,36.1,37.2,39.6, 35, 37, 37, 35)) test$detector<-ifelse(test$measure>0 & test$temp>=37, 'TYPE.II', ifelse(test$measure>0 & test$temp<37, 'TYPE.I','ok')) firstFUN <-function(x, y) y [which(x!='ok')[1]] typeFUN <-function(x, y) y [which(x!='ok')[1]] (outcome<-ddply(test, .(person), transform, "failure.day" = firstFUN(detector, day), "failure.type" = typeFUN(detector, detector)))

> test person day measure temp 1 A 7 112 36.9 2 A 14 0 36.1 3 A 21 500 37.2 4 A 22 600 39.6 5 B 7 0 35.0 6 B 14 0 37.0 7 C 7 0 37.0 8 C 14 50 35.0 >outcome person day measure temp detector failure.day failure.type 1 A 7 112 36.9 TYPE.1 7 TYPE.1 2 A 14 0 36.1 ok 7 TYPE.1 3 A 21 500 37.2 TYPE.II 7 TYPE.1 4 A 22 600 39.6 TYPE.II 7 TYPE.1 5 B 7 0 35.0 ok NA <NA> 6 B 14 0 37.0 ok NA <NA> 7 C 7 0 37.0 ok 14 TYPE.1 8 C 14 50 35.0 TYPE.1 14 TYPE.1

211 | P a g e

Find the last occurance of a value


Test<-dput(structure(list(Person = structure(c(1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 4L, 4L, 4L, 4L, 4L), .Label = c("A", "B", "C", "D"), class = "factor"), Day = c(1L, 5L, 12L, 1L, 3L, 5L, 9L, 27L, 1L, 3L, 5L, 9L, 19L, 1L, 2L, 4L, 6L, 23L), Parasites = c(100L, 0L, 0L, 34L, 15L, 11L, 0L, 0L, 188L, 15L, 0L, 8L, 0L, 35L, 0L, 0L, 12L, 10L)), .Names = c("Person", "Day", "Parasites"), class = "data.frame", row.names = c(NA, -18L))) ############################################################################# METHOD 1 TESTER <- function(day, parasites){ x <- rle(parasites) ifelse(x[[2]][length(x[[2]])]==0, as.character(day[length(parasites)+1-x[[1]][length(x[[1]])]]), "DNC" ) } NEW <- ddply(Test, .(Person), transform, "clearance.day" = TESTER(Day, Parasites)) ############################################################################ #METHOD 2 fun <- function(Parasite, Day){ tmp <- rle(rev(Parasite)) len <- length(Parasite) if(tmp$values[1] != 0){ return(rep("DNC", len)) } n <- len k <- n + 1 - tmp$lengths[1] return(rep(Day[k], len)) } ddply(Test, .(Person), summarize, Day = Day, clearance = fun(Parasites, Day)) ################################################################################# # test replications elapsed relative user.self sys.self user.child sys.child # # 1 meth1 1000 7.92 1.000000 7.05 0.00 NA NA # # 2 meth2 1000 15.70 1.982323 10.59 0.01 NA NA # #################################################################################

> Test Person Day Parasites 1 A 1 100 2 A 5 0 3 A 12 0 4 B 1 34 5 B 3 15 6 B 5 11 7 B 9 0 8 B 27 0 9 C 1 188 10 C 3 15 11 C 5 0 12 C 9 8 13 C 19 0 14 D 1 35 15 D 2 0 16 D 4 0 17 D 6 12 18 D 23 10

> NEW Person Day Parasites clearance.day 1 A 1 100 5 2 A 5 0 5 3 A 12 0 5 4 B 1 34 9 5 B 3 15 9 6 B 5 11 9 7 B 9 0 9 8 B 27 0 9 9 C 1 188 19 10 C 3 15 19 11 C 5 0 19 12 C 9 8 19 13 C 19 0 19 14 D 1 35 DNC 15 D 2 0 DNC 16 D 4 0 DNC 17 D 6 12 DNC 18 D 23 10 DNC

APPLY A FUNCTION TO A DATA SET BROKEN DOWN BY A CATEGORICAL VARIABLE


distTab(mtcars, 5)#Normal use of the function require(plyr) dlply(mtcars, .(cyl), function(x)distTab(x, 4)) dlply(mtcars, .(cyl, am), function(x)distTab(x, 4)) dlply(CO2, .(Type, Treatment), function(x)distTab(x, 4)) dlply(CO2, .(Type, Treatment), mean)

212 | P a g e

APPLY A FUNCTION BY GROUP TO TWO COLUMNS OF A DATA FRAME use: lapply with split (faster) OR by
df <- data.frame(group = rep(c("G1", "G2"), each = 10), var1 = rnorm(20), var2 = rnorm(20)) r <- by(df, df$group, FUN = function(X) cor(X$var1, X$var2, method = "spearman")) j <- lapply(split(df, df$group), function(x){cor(x[,2], x[,3], method = "spearman")}) data.frame(group = names(j), corr = unlist(j), row.names = NULL) #INPUT group 1 G1 2 G1 3 G1 4 G1 5 G1 6 G1 7 G1 8 G1 9 G1 10 G1 11 G2 12 G2 13 G2 14 G2 15 G2 16 G2 17 G2 18 G2 19 G2 20 G2 var1 -0.60324036 -1.64211667 -0.26629745 0.42810545 -1.26773098 0.78676448 0.29611857 1.96831668 0.13034798 -0.15531481 0.65258740 -1.11294137 1.35929571 -0.36637039 -1.20766391 0.27350136 -1.03189591 -0.11188425 0.05789754 -1.16903207 var2 0.22355138 -0.78414595 1.00448792 1.04770451 -0.38998673 -0.70243031 -0.51216302 -0.07017856 1.28344355 0.94086118 -0.48107934 -0.51280763 -0.85913000 -0.50303582 -0.52910758 -0.00188101 -0.11919335 -1.42868344 -1.66900549 -0.17194032

#OUTPUT (by method) >by(df, df$group, FUN = function(X) cor(X$var1, X$var2, method = "spearman")) df$group: G1 [1] 0.1515152 -----------------------------------------------------------df$group: G2 [1] -0.1151515 #OUTPUT (lapply & split method) > j <- lapply(split(df, df$group), function(x){cor(x[,2], x[,3], method = "spearman")}) > data.frame(group = names(j), corr = unlist(j), row.names = NULL) group corr 1 G1 0.1515152 2 G2 -0.1151515

213 | P a g e

Unlist & Recursively Unlist sunlist


A <- data.frame( a = c(1:10), b = c(11:20) ) B <- data.frame( a = c(101:110), b = c(111:120) ) C <- data.frame( a = c(5:8), b = c(55:58) ) L <- list(list(B,C),list(A),list(C,A),list(A,B,C),list(C)) unlist(L) unlist(L, recursive=F) #unlist everything into one vector #unlist everything into on list of many vectors

Access Elements in a List


mod <- summary(lm(cyl~mpg, data=mtcars)) mod[[4]][[2]] mod[[c(4,2)]] #uses the vectore to recursively access lists

Repeat rows (1 & 2) by times of a third column[good for table() sumarised data frames] see dfpeat() in .Rprofile dataframe[ rep( seq(dim(dataframe)[1]), 3rd column), -1] EXAMPLE
DF <- structure(list(num = c(5, 5, 4), freq = c(96, 60, 59), rank = c(1, 2, 3)), .Names = c("num", "freq", "rank"), row.names = c(NA, 3L), class = "data.frame") 1 2 3 num freq rank 5 96 1 5 60 2 4 59 3

DF2 <- DF[ rep( seq(dim(DF)[1]), DF$num), -1] rownames(DF2) <- 1:nrow(DF2) DF2 > DF2 freq rank 1 96 1 2 96 1 3 96 1 4 96 1 5 96 1 6 60 2 7 60 2 8 60 2 9 60 2 10 60 2 11 59 3 12 59 3 13 59 3 14 59 3

214 | P a g e

Turn a list into a dataframe 3 ways


> do.call(rbind, j) [,1] [,2] [,3] [,4] [1,] 1.6411064 1.157174 0.873377 0.3134954 [2,] 0.9041039 2.667465 1.965937 0.4181302 [3,] 4.4037940 2.420527 3.264888 3.8311805 [4,] 3.9637209 5.402170 5.196343 4.6943378 [5,] 3.9358796 5.866777 5.540184 4.2303664 [6,] 5.8809682 4.669888 4.773183 6.8188467 [7,] 8.3059954 6.389316 5.942269 7.4630666 [8,] 7.5501919 7.807572 7.373059 7.5226562 [9,] 8.6035129 7.044928 9.074038 8.0470154 [10,] 9.3076546 8.424741 11.628522 9.7019016 > > ldply(j, I) V1 V2 V3 V4 1 1.6411064 1.157174 0.873377 0.3134954 2 0.9041039 2.667465 1.965937 0.4181302 3 4.4037940 2.420527 3.264888 3.8311805 4 3.9637209 5.402170 5.196343 4.6943378 5 3.9358796 5.866777 5.540184 4.2303664 6 5.8809682 4.669888 4.773183 6.8188467 7 8.3059954 6.389316 5.942269 7.4630666 8 7.5501919 7.807572 7.373059 7.5226562 9 8.6035129 7.044928 9.074038 8.0470154 10 9.3076546 8.424741 11.628522 9.7019016 > > ldply(j, function(x){x}) V1 V2 V3 V4 1 1.6411064 1.157174 0.873377 0.3134954 2 0.9041039 2.667465 1.965937 0.4181302 3 4.4037940 2.420527 3.264888 3.8311805 4 3.9637209 5.402170 5.196343 4.6943378 5 3.9358796 5.866777 5.540184 4.2303664 6 5.8809682 4.669888 4.773183 6.8188467 7 8.3059954 6.389316 5.942269 7.4630666 8 7.5501919 7.807572 7.373059 7.5226562 9 8.6035129 7.044928 9.074038 8.0470154 10 9.3076546 8.424741 11.628522 9.7019016

j <- lapply(1:10, rnorm, n=4) #METHOD 1 do.call(rbind, j) #or data.frame(do.call(rbind, j)) #METHOD 2 library(plyr) ldply(j, I) #METHOD 3 ldply(j, function(x){x})

215 | P a g e

EVAL/PARSE
a <- 3 x <- "a > 2" eval(parse(text=x)) x2 <- "a==3" eval(parse(text=x2)) a <- 1:13 x <- "mean(a)" eval(parse(text=x))
## ## ## ## ## ## ## ## ## ## ## ## ## > a <- 3 > x <- "a > 2" > eval(parse(text=x)) [1] TRUE > > x2 <- "a==3" > eval(parse(text=x2)) [1] TRUE > > a <- 1:13 > x <- "mean(a)" > eval(parse(text=x)) [1] 7

216 | P a g e

RESHAPE http://had.co.nz/stat405/lectures/19-tables.pdf Data Set to Long Format for Repeated Measures library(reshape)

melt(data.frame, id=variables/columns to group by) cast(molten data.frame, formula, variable or value, agregrate.function)
#Example 1: source("C:/Users/Rinker/Desktop/PhD Program/CEP 523-Stat Meth Ed Inference/R Stuff/Scripts/Data Sets.txt") library(reshape) rep.mes2<-rep.mes Sex<-gl(2, 25, length=50,labels = c("Male", "Female")) rep.mes2<-data.frame(rep.mes2[1:2],Sex,rep.mes2[3:5]) long.rep.mes<-melt(rep.mes2,id=1:3)[order(melt(rep.mes)$Sub),] rownames(long.rep.mes)<-1:150 rep.mes2;long.rep.mes
# Example 2: d<-ascii("Code Country AFG Afghanistan 20,249 ALB Albania 8,097 d #Method 1 x1 <- reshape(d, direction="long", varying=list(names(d)[3:7]), v.names="Value", idvar=c("Code","Country"), timevar="Year", times=1950:1954) rownames(x1) <- 1:nrow(x1) x1 #Method 2 PREFERED library(reshape) x2 <- melt(d,id=c("Code","Country"),variable_name="Year") x2[,"Year"] <- as.numeric(gsub("X","",x2[,"Year"])) x2 cast(x2, Year~Country) cast(x2, Country~Year) cast(x2, Country + Code~Year) #EXAMPLE Code Country X1950 X1951 X1952 X1953 X1954 1 AFG Afghanistan 20,249 21,352 22,532 23,557 24,555 2 ALB Albania 8,097 8,986 10,058 11,123 12,246 > x2 #MELTED Code Country 1 AFG Afghanistan 2 ALB Albania 3 AFG Afghanistan 4 ALB Albania 5 AFG Afghanistan 6 ALB Albania 7 AFG Afghanistan 8 ALB Albania 9 AFG Afghanistan 10 ALB Albania Year 1950 1950 1951 1951 1952 1952 1953 1953 1954 1954 value 20,249 8,097 21,352 8,986 22,532 10,058 23,557 11,123 24,555 12,246 1950 21,352 8,986 1951 22,532 10,058 1952 23,557 11,123 1953 24,555 12,246") 1954

217 | P a g e

#RECASTED > cast(x2, Year~Country) Year Afghanistan Albania 1 1950 20,249 8,097 2 1951 21,352 8,986 3 1952 22,532 10,058 4 1953 23,557 11,123 5 1954 24,555 12,246 > cast(x2, Country~Year) Country 1950 1951 1952 1953 1954 1 Afghanistan 20,249 21,352 22,532 23,557 24,555 2 Albania 8,097 8,986 10,058 11,123 12,246 > cast(x2, Country + Code~Year) Country Code 1950 1951 1952 1953 1954 1 Afghanistan AFG 20,249 21,352 22,532 23,557 24,555 2 Albania ALB 8,097 8,986 10,058 11,123 12,246

library('reshape') DF<-data.frame("TAX"=c("A", "A", "A", "A", "B","B","B","B"), "YEAR"=c(2000,2001,2002,2003,2000,2001,2002,2004), "NUMBER"=c(2,2,3,1,3,4,3,2)) DF cast(DF, YEAR ~ TAX, value = 'NUMBER', fill = 0) DF2<-data.frame(DF, "NEW"=rnorm(nrow(DF))) cast(DF2, YEAR+NEW ~ TAX, value = 'NUMBER', fill = 0) cast(DF2, TAX ~ YEAR, value = 'NUMBER') cast(DF2, TAX ~ NUMBER, value = 'NEW', mean) cast(DF2, TAX ~ NUMBER, value = 'NEW', mean, fill=NA) cast(DF2, TAX ~ NUMBER , value = 'NEW', sum) cast(DF2, TAX + YEAR ~ NUMBER , value = 'NEW', sum) 1 2 3 4 5 6 7 8 TAX A A A A B B B B YEAR NUMBER 2000 2 2001 2 2002 3 2003 1 2000 3 2001 4 2002 3 2004 2 A 2 2 3 1 0 B 3 4 3 0 2
> cast(DF2, YEAR+NEW YEAR NEW A 1 2000 -1.77380068 2 2 2000 0.46681003 0 3 2001 -0.09072904 0 4 2001 2.19618765 2 5 2002 -1.68538164 0 6 2002 0.85410280 3 7 2003 0.13744107 1 8 2004 0.12992724 0 ~ TAX, value = 'NUMBER', fill = 0) B 0 3 4 0 3 0 0 2

1 2 3 4 5

YEAR 2000 2001 2002 2003 2004

> cast(DF2, TAX ~ YEAR, value = 'NUMBER') TAX 2000 2001 2002 2003 2004 1 A 2 2 3 1 NA 2 B 3 4 3 NA 2 > cast(DF2, TAX ~ NUMBER, value = 'NEW', mean) TAX 1 2 3 4 1 A 0.1374411 0.2111935 0.8541028 NaN 2 B NaN 0.1299272 -0.6092858 -0.09072904 > cast(DF2, TAX ~ NUMBER, value = 'NEW', mean, fill=NA) TAX 1 2 3 4 1 A 0.1374411 0.2111935 0.8541028 NA 2 B NA 0.1299272 -0.6092858 -0.09072904 > cast(DF2, TAX ~ NUMBER , value = 'NEW', sum) TAX 1 2 3 4 1 A 0.1374411 0.4223870 0.8541028 0.00000000 2 B 0.0000000 0.1299272 -1.2185716 -0.09072904 > cast(DF2, TAX + YEAR ~ NUMBER , value = 'NEW', sum) TAX YEAR 1 2 3 4 1 A 2000 0.0000000 -1.7738007 0.0000000 0.00000000 2 A 2001 0.0000000 2.1961876 0.0000000 0.00000000 3 A 2002 0.0000000 0.0000000 0.8541028 0.00000000 4 A 2003 0.1374411 0.0000000 0.0000000 0.00000000 5 B 2000 0.0000000 0.0000000 0.4668100 0.00000000 6 B 2001 0.0000000 0.0000000 0.0000000 -0.09072904 7 B 2002 0.0000000 0.0000000 -1.6853816 0.00000000 8 B 2004 0.0000000 0.1299272 0.0000000 0.00000000

DF3 <- melt(DF2, id=c("TAX", "YEAR"), na.rm=TRUE) cast(DF3, TAX ~ . | variable, mean) cast(DF3, TAX ~ . | variable, sum) cast(DF3, TAX ~ . | variable, range) #or even better cast(DF3, TAX ~ . | variable, c(min, max)) cast(DF3, YEAR + TAX ~ . | variable) recast(DF2, YEAR + TAX ~ . | variable, id.var=c("TAX", "YEAR"), measure.var=c("NUMBER", "NEW")) recast(DF2, YEAR + TAX + NUMBER ~ . | variable, id.var=c("TAX", "YEAR", "NUMBER"), measure.var=c("NEW"), fun.aggregate=range)

> cast(DF3, TAX ~ . | variable, c(min, max)) $NUMBER TAX min max 1 A 1 3 2 B 2 4 $NEW TAX min max 1 A -2.061310 1.281748 2 B -1.726776 1.986024 > cast(DF3, YEAR + TAX ~ . |variable) $NUMBER $NEW YEAR TAX (all) YEAR TAX (all) 1 2000 A 2 1 2000 A -2.06131000 2 2000 B 3 2 2000 B 1.98602360 3 2001 A 2 3 2001 A 1.10881310 4 2001 B 4 4 2001 B -0.89410042 5 2002 A 3 5 2002 A 1.28174758 6 2002 B 3 6 2002 B -1.72677556 7 2003 A 1 7 2003 A 0.05761605 8 2004 B 2 8 2004 B -0.15146665

218 | P a g e

test<-dput(structure(list(Parm = structure(c(2L, 1L, 4L, 3L, 5L, 2L, 1L, 4L, 3L, 5L), .Label = c("day", "hour", "max", "min", "outlier" ), class = "factor"), values = c(0, 0, 5, 7, 0.25, 1, 0, 5, 7, 0.25), person = c(1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L)), .Names = c("Parm", "values", "person"), class = "data.frame", row.names = c(NA, -10L))) library(reshape) cast(test, person ~ Parm, value = "values") Parm values person 1 hour 0.00 1 2 day 0.00 1 3 min 5.00 1 4 max 7.00 1 5 outlier 0.25 1 6 hour 1.00 2 7 day 0.00 2 8 min 5.00 2 9 max 7.00 2 10 outlier 0.25 2

1 2

person day hour max min outlier 1 0 0 7 5 0.25 2 0 1 7 5 0.25

219 | P a g e

Differences between a value by group


#Create the data set set.seed(1234) x <- data.frame(id=1:5, value=sample(20:30, 5, replace=T), year=3) y <- data.frame(id=1:5, value=sample(10:19, 5, replace=T), year=2) z <- data.frame(id=1:5, value=sample(0:9, 5, replace=T), year=1) df <- df.reset <- rbind(x, y, z) #The reset allows us to reset df each time #SAPPLY df <- df[order(df$id, df$year), ] sdf <-split(df, df$id) df$actual <- c(sapply(seq_along(sdf), function(x) diff(c(0, sdf[[x]][,2])))) df[order(as.numeric(rownames(df))),] #AGGREGATE df <- df.reset df <- df[order(df$id, df$year), ] diff2 <- function(x) diff(c(0, x)) df$actual <- c(unlist(t(aggregate(value~id, df, diff2)[, -1]))) df[order(as.numeric(rownames(df))),] #BY df <- df.reset df <- df[order(df$id, df$year), ] df$actual <- unlist(by(df$value, df$id, diff2)) df[order(as.numeric(rownames(df))),] #PLYR df <- df.reset df <- df[order(df$id, df$year), ] df <- data.frame(temp=1:nrow(df), df) library(plyr) df <- ddply(df, .(id), transform, actual=diff2(value)) df[order(df$year, df$temp),][, -1]

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

id value year 1 21 3 2 26 3 3 26 3 4 26 3 5 29 3 1 16 2 2 10 2 3 12 2 4 16 2 5 15 2 1 6 1 2 5 1 3 2 1 4 9 1 5 2 1

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

id value year actual 1 21 3 5 2 26 3 16 3 26 3 14 4 26 3 10 5 29 3 14 1 16 2 10 2 10 2 5 3 12 2 10 4 16 2 7 5 15 2 13 1 6 1 6 2 5 1 5 3 2 1 2 4 9 1 9 5 2 1 2

Extending this to multiple columns


#Create the data set set.seed(1234) x <- data.frame(id=1:5, value=sample(20:30, 5, replace=T), year=3) y <- data.frame(id=1:5, value=sample(10:19, 5, replace=T), year=2) z <- data.frame(id=1:5, value=sample(0:9, 5, replace=T), year=1) df <- rbind(x, y, z) df <- df.reset <- data.frame(df[, 1:2], new.var=df[, 2]+sample(1:5, nrow(df), replace=T), year=df[, 3]) #SAPPLY the BY df <- df[order(df$id, df$year), ] diff2 <- function(x) diff(c(0, x)) group.diff<- function(x) unlist(by(x, df$id, diff2)) df <- data.frame(df, sapply(df[, 2:3], group.diff)) df <- df[order(as.numeric(rownames(df))),] names(df)[5:6] <- c('actual', 'actual.new');df #TRANSFORM the BY df <- df.reset df <- df[order(df$id, df$year), ] diff2 <- function(x) diff(c(0, x)) group.diff<- function(x) unlist(by(x, df$id, diff2)) df <- transform(df, actual=group.diff(value), actual.new=group.diff(new.var)) df[order(as.numeric(rownames(df))),] #PLYR df <- df.reset df <- data.frame(temp=1:nrow(df), df) df <- df[order(df$id, df$year), ] library(plyr) df <- ddply(df, .(id), transform, actual=diff2(value), actual.new=diff2(new.var)) df[order(df$temp),][, -1]

220 | P a g e

Extract just the rows of a dataframe from the max of 1 variable


1 2 3 4 5 6 7 8 9 10 11 12 13 ID week outcome 1 2 14 1 4 28 1 6 42 4 2 14 4 6 46 4 9 64 4 9 71 4 12 85 9 2 14 9 4 28 9 6 51 9 9 66 9 12 84

We want to select the max week for each individual but return the rest of the data frame.

1 4 9

ID week outcome 1 6 42 4 12 85 9 12 84

#THE DATA df <- structure(list(ID = c(1L, 1L, 1L, 4L, 4L, 4L, 4L, 4L, 9L, 9L, 9L, 9L, 9L), week = c(2L, 4L, 6L, 2L, 6L, 9L, 9L, 12L, 2L, 4L, 6L, 9L, 12L), outcome = c(14L, 28L, 42L, 14L, 46L, 64L, 71L, 85L, 14L, 28L, 51L, 66L, 84L)), .Names = c("ID", "week", "outcome" ), class = "data.frame", row.names = c(NA, -13L)) #METHOD 1 do.call("rbind", by(df, INDICES=df$ID, FUN=function(DF) DF[which.max(DF$week),])) #METHOD 2 library(data.table) dt <- data.table(df, key="ID") dt[, .SD[which.max(outcome),], by=ID] #METHOD 3 library(plyr) ddply(df, .(ID), function(X) X[which.max(X$week), ]) #METHOD 4 sdf <-with(df, split(df, ID)) max.week <- sapply(seq_along(sdf), function(x) which.max(sdf[[x]][, 'week'])) data.frame(t(mapply(function(x, y) y[x, ], max.week, sdf))) #METHOD 5 df[rev(rownames(df)),][match(unique(df$ID), rev(df$ID)), ] #METHOD 6 sdf <-with(df, split(df, ID)) df[cumsum(sapply(seq_along(sdf), function(x) which.max(sdf[[x]][, 'week']))), ] #METHOD 7 df[cumsum(aggregate(week ~ ID, df, which.max)$week), ] #METHOD 8 df[tapply(seq_along(df$ID), df$ID, function(x){tail(x,1)}),] #METHOD 9 df[cumsum(as.numeric(lapply(split(df$week, df$ID), which.max))), ]

See the rbenchmark results on the next page:

221 | P a g e

library(rbenchmark) benchmark( DATA.TABLE= {dt <- data.table(df, key="ID") dt[, .SD[which.max(outcome),], by=ID]}, DO.CALL={do.call("rbind", by(df, INDICES=df$ID, FUN=function(DF) DF[which.max(DF$week),]))}, PLYR=ddply(df, .(ID), function(X) X[which.max(X$week), ]), SPLIT={sdf <-with(df, split(df, ID)) max.week <- sapply(seq_along(sdf), function(x) which.max(sdf[[x]][, 'week'])) data.frame(t(mapply(function(x, y) y[x, ], max.week, sdf)))}, MATCH.INDEX=df[rev(rownames(df)),][match(unique(df$ID), rev(df$ID)), ], AGGREGATE=df[cumsum(aggregate(week ~ ID, df, which.max)$week), ], BRYANS.INDEX = df[cumsum(as.numeric(lapply(split(df$week, df$ID), which.max))), ], SPLIT2={sdf <-with(df, split(df, ID)) df[cumsum(sapply(seq_along(sdf), function(x) which.max(sdf[[x]][, 'week']))), ]}, TAPPLY= df[tapply(seq_along(df$ID), df$ID, function(x){tail(x,1)}),], columns = c( "test", "replications", "elapsed", "relative", "user.self","sys.self"), order = "test", replications = 1000, environment = parent.frame())

6 7 1 2 5 3 4 8 9

test replications elapsed relative user.self sys.self AGGREGATE 1000 4.49 7.610169 2.84 0.05 BRYANS.INDEX 1000 0.59 1.000000 0.20 0.00 DATA.TABLE 1000 20.28 34.372881 11.98 0.00 DO.CALL 1000 4.67 7.915254 2.95 0.03 MATCH.INDEX 1000 1.07 1.813559 0.51 0.00 PLYR 1000 10.61 17.983051 5.07 0.00 SPLIT 1000 3.12 5.288136 1.81 0.00 SPLIT2 1000 1.56 2.644068 1.28 0.00 TAPPLY 1000 1.08 1.830508 0.88 0.00

222 | P a g e

Summaraize (apply function) to a numeric variable by 2 categorical variables


#================ #Make a data set #================ n <- 100 dat <- data.frame( Accuracy = round(runif(n, 0, 5), 1), Month = sample(1:2, n, replace=TRUE), Day = sample(1:5, n, replace=TRUE), Easting = rnorm(n), Northing = rnorm(n), Etc = rnorm(n) ) #========== #using plyr #========== library(plyr) ddply( dat, c("Month", "Day"), function (x) x[ which.min(x$Accuracy), ] ) #========== #using base #========== t(sapply( split(dat, list(dat$Month, dat$Day)), function(d) d[ which.min(d$Accuracy), ])) #You can get quasi there with (but not all the data frame will come along): aggregate(Accuracy ~ Month + Day, data = dat, FUN = min) #OUTCOME (find min value by month and day) 1 2 3 4 5 6 7 8 9 10 Accuracy Month Day Easting Northing Etc 1.0 1 1 -1.2107186 -0.06473102 1.5195738 0.7 1 2 0.7552501 1.20389863 0.1319931 0.5 1 3 1.1104158 -0.31173230 -0.4738744 0.5 1 4 -0.7936402 0.94957122 -0.5173246 0.4 1 5 0.1725260 2.50637015 0.5808553 0.1 2 1 1.1359366 1.73373416 1.1122071 0.3 2 2 0.9101894 0.57581224 0.2726678 0.2 2 3 -0.2905642 0.67290842 1.7687111 0.7 2 4 -2.2955213 0.23270159 1.2040872 0.0 2 5 1.1167519 1.04612217 -0.7811158

223 | P a g e

Apply multiple functions to multiple outcomes by multiple groups


> head(CO2) 1 2 3 4 5 6 Plant Qn1 Qn1 Qn1 Qn1 Qn1 Qn1 Type Quebec Quebec Quebec Quebec Quebec Quebec Treatment conc uptake nonchilled 95 16.0 nonchilled 175 30.4 nonchilled 250 34.8 nonchilled 350 37.2 nonchilled 500 35.3 nonchilled 675 39.2

aggregate(cbind(conc, uptake) ~ Plant + Type + Treatment, data=CO2, FUN=mean) 1 2 3 4 5 6 7 8 9 10 11 12 Plant Qn1 Qn2 Qn3 Mn3 Mn2 Mn1 Qc1 Qc3 Qc2 Mc2 Mc3 Mc1 Type Quebec Quebec Quebec Mississippi Mississippi Mississippi Quebec Quebec Quebec Mississippi Mississippi Mississippi Treatment conc uptake nonchilled 435 33.22857 nonchilled 435 35.15714 nonchilled 435 37.61429 nonchilled 435 24.11429 nonchilled 435 27.34286 nonchilled 435 26.40000 chilled 435 29.97143 chilled 435 32.58571 chilled 435 32.70000 chilled 435 12.14286 chilled 435 17.30000 chilled 435 18.00000

SUM <- function(x) c(mean=mean(x), sd=sd(x), n=length(x)) aggregate(cbind(conc, uptake) ~ Plant + Type + Treatment, data=CO2, FUN=SUM) 1 2 3 4 5 6 7 8 9 10 11 12 Plant Qn1 Qn2 Qn3 Mn3 Mn2 Mn1 Qc1 Qc3 Qc2 Mc2 Mc3 Mc1 Type Quebec Quebec Quebec Mississippi Mississippi Mississippi Quebec Quebec Quebec Mississippi Mississippi Mississippi Treatment conc.mean conc.sd nonchilled 435.0000 317.7263 nonchilled 435.0000 317.7263 nonchilled 435.0000 317.7263 nonchilled 435.0000 317.7263 nonchilled 435.0000 317.7263 nonchilled 435.0000 317.7263 chilled 435.0000 317.7263 chilled 435.0000 317.7263 chilled 435.0000 317.7263 chilled 435.0000 317.7263 chilled 435.0000 317.7263 chilled 435.0000 317.7263 conc.n uptake.mean uptake.sd 7.0000 33.228571 8.214766 7.0000 35.157143 11.004069 7.0000 37.614286 10.349948 7.0000 24.114286 6.484707 7.0000 27.342857 7.652855 7.0000 26.400000 8.694251 7.0000 29.971429 8.334609 7.0000 32.585714 10.321083 7.0000 32.700000 11.336960 7.0000 12.142857 2.186974 7.0000 17.300000 3.049044 7.0000 18.000000 4.118657 uptake.n 7.000000 7.000000 7.000000 7.000000 7.000000 7.000000 7.000000 7.000000 7.000000 7.000000 7.000000 7.000000

224 | P a g e

Table of Means sTABLE of MEANS smeans sMeans Table smeans sTABLE of MEANS smeans sTABLE of
dat <- structure(list(partic = c(4.875, 3.375, 4.5, 2.875, 4, 4.625, 4.375, 4, 4.375, 3.625, 3.25, 4.875, 4.625, 4.875, 4.125, 3.25, 2.5, 3.875, 3.75, 3.625, 3.375, 4.75, 4.75, 3.57142857142857, 2.5, 4.125, 3.5, 3.375, 3.5, 4.5, 4.375, 3.66666666666667, 1.5, 4.375, 3.875, 4.375, 3.14285714285714, 3.875, 3.875, 3.125, 3.25, 2.375, 2.5, 3.5, 4.25, 4.25, 3.5, 3.625, 3.5, 3.625, 3.75, 3.625, 3.625, 4.25, 4, 4, 3.75, 3.875, 3.5, 4.375, 4, 3.5, 3.75, 3.375, 4.375, 3.875, 1.75, 4.5, 3.75, 3.625, 4, 4, 3.875, 2.75, 3.625, 3.5, 4.5, 4.125, 4.125, 4.625, 3.125, 4.625, 3.875, 3, 4.5, 4.25, 4.375, 4.25, 3.625, 3.5, 2.5, 2.875, 2.875, 2.5, 3.75, 4, 2.875, 2.375, 4.125, 4.5), grade = structure(c(3L, 4L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 1L, 1L, 4L, 4L, 4L, 4L, 4L, 3L, 1L, 4L, 3L, 4L, 2L, 2L, 3L, 4L, 4L, 2L, 2L, 2L, 2L, 2L, 3L, 2L, 2L, 4L, 4L, 4L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 3L, 4L, 3L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 3L, 3L, 3L, 3L, 3L, 2L, 3L, 2L, 2L, 2L), .Label = c("freshman", "sophomore", "junior", "senior"), class = "factor"), race3 = structure(c(1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 2L, 2L, 2L, 1L, 1L, 3L, 2L, 3L, 2L, 3L, 2L, 2L, 3L, 2L, 2L, 1L, 3L, 1L, 3L, 2L, 1L, 1L, 2L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 3L, 2L, 1L, 1L, 1L, 1L, 2L, 1L, 2L, 1L, 3L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 1L, 1L, 1L, 2L, 1L, 3L, 1L, 2L, 1L, 3L, 1L, 1L, 2L, 1L, 2L, 2L, 2L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 1L, 2L, 2L, 2L, 2L, 2L, 1L), .Label = c("white/asian", "black", "hispanic"), class = "factor")), .Names = c("partic", "grade", "race3"), na.action = structure(c(11L, 42L, 154L, 210L, 230L, 282L, 306L, 336L, 349L, 352L, 377L, 378L, 397L, 437L, 477L ), .Names = c("11", "42", "154", "210", "230", "282", "306", "336", "349", "352", "377", "378", "397", "437", "477"), class = "omit"), row.names = c(NA, 100L), class = "data.frame") #==================================================================== DF.rs <- melt(dat, id=c("grade", "race3")) MT <- function(x){ paste(round(mean(x), digits=2), "(", round(sd(x), digits=2), ")", sep="")} cast(DF.rs, grade ~ race3, fun.aggregate=MT, margins=c("grand_row", "grand_col"))

library(reshape) dat <- read.table(text=" request user group 1 1 1 4 1 1 7 1 1 5 1 2 8 1 2 1 2 3 4 2 3 7 2 3 9 2 4 ", header=T newdat <- ddply(dat, .(user, group), transform, idx = paste("request", 1:length(request), sep = "")) cast(newdat, user + group ~ idx, value = .(request)) > cast(newdat, user + group ~ idx, value = .(request)) user group request1 request2 request3 1 1 1 1 4 7 2 1 2 5 8 NA 3 2 3 1 4 7 4 2 4 9 NA NA

of MEANS smeans sTABLE of MEANS smeans sTABLE

225 | P a g e

names(airquality) <- tolower(names(airquality)) aqm <- melt(airquality, id=c("month", "day"), na.rm=TRUE) cast(aqm, cast(aqm, cast(aqm, cast(aqm, cast(aqm, cast(aqm, cast(aqm, cast(aqm, day ~ month ~ variable) month ~ variable, mean) month ~ . | variable, mean) month ~ variable, mean, margins=c("grand_row", "grand_col")) day ~ month, mean, subset=variable=="ozone") month ~ variable, range) month ~ variable + result_variable, range) variable ~ month ~ result_variable,range)

MEANSofMEANS smeans tables

#Chick weight example names(ChickWeight) <- tolower(names(ChickWeight)) chick_m <- melt(ChickWeight, id=2:4, na.rm=TRUE) cast(chick_m, time ~ variable, mean) # average effect of time cast(chick_m, diet ~ variable, mean) # average effect of diet cast(chick_m, diet ~ time ~ variable, mean) # average effect of diet & time # How many chicks at each time? - checking for balance cast(chick_m, time ~ diet, length) cast(chick_m, chick ~ time, mean) cast(chick_m, chick ~ time, mean, subset=time < 10 & chick < 20) cast(chick_m, diet + chick ~ time) cast(chick_m, chick ~ time ~ diet) cast(chick_m, diet + chick ~ time, mean, margins="diet") #Tips example cast(melt(tips), sex ~ smoker, mean, subset=variable=="total_bill") cast(melt(tips), sex ~ smoker | variable, mean) ff_d <- melt(french_fries, id=1:4, na.rm=TRUE) cast(ff_d, subject ~ time, length) cast(ff_d, subject ~ time, length, fill=0) cast(ff_d, subject ~ time, function(x) 30 - length(x)) cast(ff_d, subject ~ time, function(x) 30 - length(x), fill=30) cast(ff_d, variable ~ ., c(min, max)) cast(ff_d, variable ~ ., function(x) quantile(x,c(0.25,0.5))) cast(ff_d, treatment ~ variable, mean, margins=c("grand_col", "grand_row")) cast(ff_d, treatment + subject ~ variable, mean, margins="treatment")

226 | P a g e

From long to wide format reshape(data, varying = NULL, v.names = NULL, timevar = "time", idvar = "id", ids = 1:NROW(data), times = seq_along(varying[[1]]), drop = NULL, direction, new.row.names = NULL)

Arguments
a data frame varying names of sets of variables in the wide format that correspond to single variables in long format (time-varying). This is canonically a list of vectors of variable names, but it can optionally be a matrix of names, or a single vector of names. In each case, the names can be replaced by indices which are interpreted as referring to names(data). See below for more details and options. v.names names of variables in the long format that correspond to multiple variables in the wide format. See below for details. timevar the variable in long format that differentiates multiple records from the same group or individual. idvar Names of one or more variables in long format that identify multiple records from the same group/individual. These variables may also be present in wide format. ids the values to use for a newly created idvar variable in long format. times the values to use for a newly created timevar variable in long format. See below for details. drop a vector of names of variables to drop before reshaping direction character string, either "wide" to reshape to wide format, or "long" to reshape to long format. new.row.names logical; if TRUE and direction="wide", create new row names in long format from the values of the id and time variables.
data
DF<-data.frame(id=rep(paste("subject", 1:5, sep=" "), 3), sex=rep(c("m","m","m","f","f"), 3), time=c(rep("Time1",5), rep("Time2",5), rep("Time3",5)), score1=rnorm(15), score2=abs(rnorm(15)*4)) wide <- reshape(DF, v.names=c("score1", "score2"), idvar="id", timevar="time", direction="wide") wide long <- with(wide, reshape(wide, idvar="id", v.names=c("score1", "score2"), direction="long")) rownames(long)<-1:nrow(long) long

#USING RESHAPE

DF<-data.frame(id=rep(paste("subject", 1:5, sep=" "), 3), sex=rep(c("m","m","m","f","f"), 3), time=c(rep("Time1",5), rep("Time2",5), rep("Time3",5)), score1=rnorm(15), score2=abs(rnorm(15)*4))

library(reshape) m <- melt(DF) cast(m,id+sex~...) cast(m,id+sex~variable+time)

227 | P a g e

Long to wide example (base's reshape and reshape package)


dat <- data.frame(county = rep(letters[1:4], each=2), state = rep(LETTERS[1], times=8), industry = rep(c("construction", "manufacturing"), 4), employment = round(rnorm(8, 100, 50), 0), establishments = round(rnorm(8, 20, 5), 0))
1 2 3 4 state county construction_employment construction_establishments manufacturing_employment manufacturing_establishments A a 100 24 159 26 A b 117 17 64 25 A c 85 23 50 19 A d 21 14 48 8

#Method 1 (base) reshape(dat, direction="wide", idvar=c("state", "county"), timevar="industry") #method 2 (reshape 2 package) library(reshape2) m <- melt(dat) dcast(m, state + county~...)

1 2 3 4 5 6 7 8

county state industry employment establishments a A construction 100 24 a A manufacturing 159 26 b A construction 117 17 b A manufacturing 64 25 c A construction 85 23 c A manufacturing 50 19 d A construction 21 14 d A manufacturing 48 8

228 | P a g e

Long to wide with base reshape explained:


df3 <- data.frame(school = rep(1:3, each = 4), class = rep(9:10, 6), time = rep(c(1,1,2,2), 3), score = rnorm(12)) df3 wide <- reshape(df3, idvar = c("school","class"), direction = "wide") wide DF<-data.frame(id=rep(paste("subject", 1:5, sep=" "), 3), sex=rep(c("m","m","m","f","f"), 3), time=c(rep("Time1",5), rep("Time2",5), rep("Time3",5)), score1=rnorm(15), score2=abs(rnorm(15)*4)) DF wide <- reshape(DF, v.names=c("score1", "score2"), idvar="id", timevar="time", direction="wide") DF2 <- expand.grid(market = LETTERS[1:5], date = Sys.Date()+(0:5), sitename = letters[1:2]) DF2$impression <- sample(100, nrow(DF2), replace=TRUE) DF2$clicks <- sample(100, nrow(DF2), replace=TRUE) DF2 wide <- reshape(DF2, v.names=c("impression", "clicks"), idvar=c("market", "date"), timevar="sitename", direction="wide") wide What's going on with reshape when its long to wide. timevar v.names these are the repeated measures; that may be times or locations etc. [categorical] the repeated measures measurement (in both these case we have two different variables being measured over two different times)[numeric] these are the variables we want to replicate and unstack to match to the timevar and idvar

idvar

Basically worry about what your repeated measures variable is (timevar). This is not numeric but categorical. Then enter in your actual measures taken at each repeated measure (v.names). This is usually numeric (however could be categorical). Generally everything remaining is an id variable.

229 | P a g e

Wide to Long with > 2 Measures Per Time


FROM
1 2 id x1 x2 x3 y1 y2 y3 z1 z2 z3 v 1 2 4 5 10 20 15 200 150 170 2.5 2 3 7 6 25 35 40 300 350 400 4.2 id xsource x y v 1 x1 2 10 2.5 1 x2 4 20 2.5 1 x3 5 15 2.5 2 x1 3 25 4.2 2 x2 7 35 4.2 2 x3 6 40 4.2

TO

1 2 3 4 5 6

CODE
x <- read.table(text=" id x1 x2 x3 y1 1 2 4 5 10 2 3 7 6 25 ", header=TRUE) x #=============================================================== #METHOD #1 res <- reshape(x, direction = "long", idvar = "id", varying = list(c("x1","x2", "x3"), c("y1", "y2", "y3"), c("z1", "z2", "z3")), v.names = c("x", "y", "z"), timevar = "xsource", times = c("x1", "x2", "x3")) res <- res[order(res$id, res$xsource), c(1,3,4,5,2)] row.names(res) <- NULL res =============================================================== #METHOD #2 chunks <- lapply(1:nrow(x), function(i)cbind(x[i, 1], 1:3, matrix(x[i, 2:10], ncol=3), x[i, 11])) res <- do.call(rbind, chunks) colnames(res) <- c("id", "source", "x", "y", "z", "v") res y2 20 35 y3 15 40 z1 200 300 z2 150 350 z3 170 400 v 2.5 4.2

230 | P a g e

Wide to Long with Multiple Measures per Time (stacking and double stacking closely examined) Original
id 1 x1.1 2 x1.2 3 x1.3 4 x1.4 5 x1.5 6 x1.6 7 x1.7 8 x1.8 9 x1.9 10 x1.10 trt tr cnt cnt tr cnt cnt cnt cnt tr cnt work.T1 0.65165567 0.56773775 0.11350898 0.59592531 0.35804998 0.42880942 0.05190332 0.26417767 0.39879073 0.83613414 play.T1 0.8647212 0.6153524 0.7751099 0.3555687 0.4058500 0.7066469 0.8382877 0.2395891 0.7707715 0.3558977 talk.T1 0.53559704 0.09308813 0.16980304 0.89983245 0.42263761 0.74774647 0.82265258 0.95465365 0.68544451 0.50050323 total.T1 0.27548386 0.22890394 0.01443391 0.72896456 0.24988047 0.16118328 0.01704265 0.48610035 0.10290017 0.80154700 work.T2 0.3543281 0.9364325 0.2458664 0.4731415 0.1915609 0.5832220 0.4594732 0.4674340 0.3998326 0.5052856 play.T2 0.03188816 0.11446759 0.46893548 0.39698674 0.83361919 0.76112174 0.57335645 0.44750805 0.08380201 0.21913855 talk.T2 0.07557029 0.53442678 0.64135658 0.52573932 0.03928139 0.54585984 0.37276310 0.96130241 0.25734157 0.20795168 total.T2 0.86138244 0.46439198 0.22286743 0.62354960 0.20364770 0.01967341 0.79799301 0.27431890 0.16660910 0.17015172

Double Stack
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 id x1.1 x1.2 x1.3 x1.4 x1.5 x1.6 x1.7 x1.8 x1.9 x1.10 x1.1 x1.2 x1.3 x1.4 x1.5 x1.6 x1.7 x1.8 x1.9 x1.10 x1.1 x1.2 x1.3 x1.4 x1.5 x1.6 x1.7 x1.8 x1.9 x1.10 x1.1 x1.2 x1.3 x1.4 x1.5 x1.6 x1.7 x1.8 x1.9 x1.10 x1.1 x1.2 x1.3 x1.4 x1.5 x1.6 x1.7 x1.8 x1.9 x1.10 x1.1 x1.2 x1.3 x1.4 x1.5 x1.6 x1.7 x1.8 x1.9 x1.10 x1.1 x1.2 x1.3 x1.4 x1.5 x1.6 x1.7 x1.8 x1.9 x1.10 x1.1 x1.2 x1.3 x1.4 x1.5 x1.6 x1.7 x1.8 x1.9 x1.10 trt time type measures tr 1 work 0.65165567 cnt 1 work 0.56773775 cnt 1 work 0.11350898 tr 1 work 0.59592531 cnt 1 work 0.35804998 cnt 1 work 0.42880942 cnt 1 work 0.05190332 cnt 1 work 0.26417767 tr 1 work 0.39879073 cnt 1 work 0.83613414 tr 2 work 0.35432806 cnt 2 work 0.93643254 cnt 2 work 0.24586639 tr 2 work 0.47314146 cnt 2 work 0.19156087 cnt 2 work 0.58322197 cnt 2 work 0.45947319 cnt 2 work 0.46743405 tr 2 work 0.39983256 cnt 2 work 0.50528560 tr 1 play 0.86472123 cnt 1 play 0.61535242 cnt 1 play 0.77510990 tr 1 play 0.35556869 cnt 1 play 0.40584997 cnt 1 play 0.70664691 cnt 1 play 0.83828767 cnt 1 play 0.23958913 tr 1 play 0.77077153 cnt 1 play 0.35589774 tr 2 play 0.03188816 cnt 2 play 0.11446759 cnt 2 play 0.46893548 tr 2 play 0.39698674 cnt 2 play 0.83361919 cnt 2 play 0.76112174 cnt 2 play 0.57335645 cnt 2 play 0.44750805 tr 2 play 0.08380201 cnt 2 play 0.21913855 tr 1 talk 0.53559704 cnt 1 talk 0.09308813 cnt 1 talk 0.16980304 tr 1 talk 0.89983245 cnt 1 talk 0.42263761 cnt 1 talk 0.74774647 cnt 1 talk 0.82265258 cnt 1 talk 0.95465365 tr 1 talk 0.68544451 cnt 1 talk 0.50050323 tr 2 talk 0.07557029 cnt 2 talk 0.53442678 cnt 2 talk 0.64135658 tr 2 talk 0.52573932 cnt 2 talk 0.03928139 cnt 2 talk 0.54585984 cnt 2 talk 0.37276310 cnt 2 talk 0.96130241 tr 2 talk 0.25734157 cnt 2 talk 0.20795168 tr 1 total 0.27548386 cnt 1 total 0.22890394 cnt 1 total 0.01443391 tr 1 total 0.72896456 cnt 1 total 0.24988047 cnt 1 total 0.16118328 cnt 1 total 0.01704265 cnt 1 total 0.48610035 tr 1 total 0.10290017 cnt 1 total 0.80154700 tr 2 total 0.86138244 cnt 2 total 0.46439198 cnt 2 total 0.22286743 tr 2 total 0.62354960 cnt 2 total 0.20364770 cnt 2 total 0.01967341 cnt 2 total 0.79799301 cnt 2 total 0.27431890 tr 2 total 0.16660910 cnt 2 total 0.17015172

Single Stack
id trt times Work Play Talk Total 1 x1.1 tr 1 0.65165567 0.86472123 0.53559704 0.27548386 2 x1.2 cnt 1 0.56773775 0.61535242 0.09308813 0.22890394 3 x1.3 cnt 1 0.11350898 0.77510990 0.16980304 0.01443391 4 x1.4 tr 1 0.59592531 0.35556869 0.89983245 0.72896456 5 x1.5 cnt 1 0.35804998 0.40584997 0.42263761 0.24988047 6 x1.6 cnt 1 0.42880942 0.70664691 0.74774647 0.16118328 7 x1.7 cnt 1 0.05190332 0.83828767 0.82265258 0.01704265 8 x1.8 cnt 1 0.26417767 0.23958913 0.95465365 0.48610035 9 x1.9 tr 1 0.39879073 0.77077153 0.68544451 0.10290017 10 x1.10 cnt 1 0.83613414 0.35589774 0.50050323 0.80154700 11 x1.1 tr 2 0.35432806 0.03188816 0.07557029 0.86138244 12 x1.2 cnt 2 0.93643254 0.11446759 0.53442678 0.46439198 13 x1.3 cnt 2 0.24586639 0.46893548 0.64135658 0.22286743 14 x1.4 tr 2 0.47314146 0.39698674 0.52573932 0.62354960 15 x1.5 cnt 2 0.19156087 0.83361919 0.03928139 0.20364770 16 x1.6 cnt 2 0.58322197 0.76112174 0.54585984 0.01967341 17 x1.7 cnt 2 0.45947319 0.57335645 0.37276310 0.79799301 18 x1.8 cnt 2 0.46743405 0.44750805 0.96130241 0.27431890 19 x1.9 tr 2 0.39983256 0.08380201 0.25734157 0.16660910 20 x1.10 cnt 2 0.50528560 0.21913855 0.20795168 0.17015172

See the next two pages for how to stack and double stack with 1. reshape (base function) 2. reshape (the package) 3. rbinding and cbinding

#The Data Frame id <- paste('x', "1.", 1:10, sep="") set.seed(10) DF <- data.frame(id, trt=sample(c('cnt', 'tr'), 10, T), work.T1=runif(10), play.T1=runif(10), talk.T1=runif(10), total.T1=runif(10), work.T2=runif(10), play.T2=runif(10), talk.T2=runif(10), total.T2=runif(10))

231 | P a g e

Single Stack
2 Methods using reshape from base:
#Method 1 NEW <- reshape(DF, varying=list(work= c(3, 7), play= c(4,8), talk= c(5,9), total= c(6,10) ), v.names=c("work", "play", "talk", "total"), # that was needed after changed 'varying' arg to a list to allow 'times' direction="long", times=1:2, # substitutes number fot T1 and T2 timevar="times") # to name the time col These items in green rownames(NEW) <- 1:nrow(NEW) #Method 2 (shorter but less explicit) NEW <- reshape(DF, direction="long", varying=3:10, sep=".T") rownames(NEW) <- 1:nrow(NEW) NEW

are doing the same thing, turning the T.1 & T.2 into 1 & 2

Method from reshape package:


library(reshape) DF2 <- melt(DF,id.vars=1:2) DF3 <- cbind(DF2, colsplit(as.character(DF2$variable),"\\.", names=c("activity","times"))) ## rename time, reorder factors: DF4 <- transform(DF3, times=as.numeric(gsub("^T","",time)), activity=factor(activity, levels=c("work","play","talk","total")), id=factor(id,levels=paste("x1",1:10,sep="."))) ## reshape back to wide DF5 <- cast(subset(DF4,select=-variable),id+trt+times~activity) ## reorder NEW <- with(DF5,DF5[order(time,id),]) NEW

2 Methods using rbinding and cbinding:


#Method 1 DF.1 <- DF[, 1:2] DFlist <- list(DF[, 3:6], DF[, 7:10]) lapply(seq_along(DFlist), function(x) names(DFlist[[x]]) <<unlist(strsplit(names(DFlist[[x]])[1:length(names(DFlist[[x]]))], ".", fixed=T))[c(T, F)] ) repeats <- 2 #Number of repeated measures time <- rep(1:repeats, each=nrow(DF.1)) NEW <- data.frame(DF.1[rep(seq_len(nrow(DF.1)), repeats), ], time, do.call('rbind', DFlist)) NEW #Method DF.1 <DF.2 <DF.3 <2 DF[, 1:2] DF[, 3:6] DF[, 7:10]

repeats <- 2 #Number of repeated measures names(DF.2) <- names(DF.3) <- unlist(strsplit(names(DF.2), ".", fixed=T))[c(T,F)] time <- rep(1:repeats, each=nrow(DF.1)) NEW <- data.frame(DF.1[rep(seq_len(nrow(DF.1)), repeats), ], time, rbind(DF.2, DF.3)) NEW

Replicate and stack subset (columns) of a data frame repeat rows


This is a method of stacking the same data frame x number of times (the id variables) replicate rows

dataframe[rep(seq_len(nrow(dataframe)), repeats), ]

232 | P a g e

Where: dataframe- is the data frame to be repeated and stacked repeats- is the number of time to repeate the dataframe

Single Stack

(This can be useful for certain analysis or graphics such as repeated measures or faceting in ggplot2)

Method using reshape from base:


NEW2 <- reshape(NEW, direction = "long", idvar = c("id", "trt", "time"), varying = list(c("work", "play", "talk", "total")), v.names = c("measures"), timevar = "type", times = c("work", "play", "talk", "total")) rownames(NEW2) <- 1:nrow(NEW2) NEW2

Method from reshape package:


require(reshape) DF2 <- melt(DF,id.vars=1:2) DF3 <- cbind(DF2, colsplit(as.character(DF2$variable),"\\.", names=c("type","times"))) NEW2 <- with(DF3, DF3[, c('id', 'trt', 'times', 'type', 'value')]) levels(NEW2$times) <- 1:2 NEW2

2 Methods using rbinding and cbinding:


NEW2 <- reshape(NEW, direction = "long", idvar = c("id", "trt", "time"), varying = list(c("work", "play", "talk", "total")), v.names = c("measures"), timevar = "type", times = c("work", "play", "talk", "total")) rownames(NEW2) <- 1:nrow(NEW2) NEW2

233 | P a g e

Another Wide To Long With Akwardly Named Columns (rename 'em for ease)
#THE DATA SET dat <- read.table(text=" WorkerId pio_1_1 pio_1_2 pio_1_3 pio_1_4 pio_2_1 pio_2_2 pio_2_3 pio_2_4 1 1 Yes No No No No No Yes No 2 2 No Yes No No Yes No Yes No 3 3 Yes Yes No No Yes No Yes No", header=T) redat <- dat #To reset the Data

The Trick to Getting the Most Out of Reshape is to Get Your Column Names in an R Friendly Format to Begin with or You Have to Specify to Varying What Columns to Stack on What.
#METHOD 1 (Cool renaming; If you rename varying is easy) #The "([a-z])_([0-9])_([0-9])" part says look for a character group then "_" followed by a digit #string then "_" followed by a digit string. The "\\1_\\3\\.\\2" means keep the first character #string and "_" right in the first spot. Then take the last digit string (#3) and make it #second, then put a period and take the 2nd digit string and put it 3rd. names(dat) <- gsub("([a-z])_([0-9])_([0-9])", "\\1_\\3\\.\\2", names(dat)) #names(dat) <- gsub("([0-9])_([0-9])$", "\\2\\.\\1", names(dat)) # another way dat2 <- reshape(dat, direction="long", varying=2:9, timevar="set", idvar=1) row.names(dat2) <- NULL dat2[order(dat2$WorkerId), ] #METHOD 2 (My Method; If you rename varying is easy) y <- do.call('rbind', strsplit(names(dat)[-1], "_"))[, c(1, 3, 2)] names(dat) <- c(names(dat)[1], paste0(y[, 1], "_", y[, 2], ".", y[, 3])) dat2 <- reshape(dat, varying=2:9, idvar = "WorkerId", direction="long", timevar="set") row.names(dat2) <- NULL dat2[order(dat2$WorkerId, dat2$set), ] #METHOD 3 (Using Reshape) dat <- redat library("reshape2") reshape.middle <- function(dat) { dat <- melt(so, id="WorkerId") dat$set <- substr(dat$variable, 5,5) dat$name <- paste(substr(dat$variable, 1, 4), substr(dat$variable, 7, 7), sep="") dat$variable <- NULL dat <- melt(dat, id=c("WorkerId", "set", "name")) dat$variable <- NULL return(dcast(dat, WorkerId + set ~ name)) } reshape.middle(dat) #Without the rename You'd have to approach it this way dat2 <- reshape(dat, varying=list(pio_1= c(2, 6), pio_2= c(3,7), pio_3= c(4,8), pio_4= c(5,9) ), v.names=c(paste0("pio_",1:4)), idvar = "WorkerId", direction="long", timevar="set") row.names(dat2) <- NULL dat2[order(dat2$WorkerId, dat2$set), ]

234 | P a g e

Randomish Rows Long Format to Wide w/ Missing Data


var<-c("Id", "Name", "Score", "Id", "Score", "Id", "Name") num<-c(1, "Tom", 4, 2, 7, 3, "Jim") format1<-data.frame(var, num) format1 #STARTING DATAFRAME # # var num # 1 Id 1 # 2 Name Tom # 3 Score 4 # 4 Id 2 # 5 Score 7 # 6 Id 3 # 7 Name Jim format1$ID <- cumsum(format1$var == "Id") #ADD THE cumsum ID COLUMN (IMPORTANT FOR BOTH METHODS) # # var num ID # 1 Id 1 1 # 2 Name Tom 1 # 3 Score 4 1 # 4 Id 2 2 # 5 Score 7 2 # 6 Id 3 3 # 7 Name Jim 3 # METHOD 1 format2 <- reshape(format1, idvar = "ID",timevar = "var", direction = "wide")[-1] names(format2) <- gsub("num.", "", names(format2)) format2 #OUTCOME # # Id Name Score # 1 1 Tom 4 # 4 2 <NA> 7 # 6 3 Jim <NA> # METHOD 2 reshape(format1, idvar = "ID",timevar = "var", direction = "wide", varying = list(c("Id", "Name", "Score")))[-1] # METHOD 3 format1$pk <- cumsum( format1$var=="Id" ) library(reshape2) dcast( format1, pk ~ var, value.var="num" )

235 | P a g e

Extract object names from list in a function (using both lapply and a for loop)
x <- c("yes", "no", "maybe", "no", "no", "yes") y <- c("red", "blue", "green", "green", "orange") list.xy <- list(x=x, y=y) WORD.C <- function(WORDS){ require(wordcloud) L2 <- lapply(WORDS, function(x) as.data.frame(table(x), stringsAsFactors = FALSE)) # Takes a dataframe and the text you want to display FUN <- function(X, text){ windows() wordcloud(X[, 1], X[, 2], min.freq=1) mtext(text, 3, padj=-4.5, col="red") #what I'm trying that isn't working } # Now creates the sequence 1,...,length(L2) # Loops over that and then create an anonymous function # to send in the information you want to use. lapply(seq_along(L2), function(i){FUN(L2[[i]], names(L2)[i])}) } WORD.C2 <- function(WORDS){ require(wordcloud) L2 <- lapply(WORDS, function(x) as.data.frame(table(x), stringsAsFactors = FALSE)) # Takes a dataframe and the text you want to display FUN <- function(X, text){ windows() wordcloud(X[, 1], X[, 2], min.freq=1) mtext(text, 3, padj=-4.5, col="red") #what I'm trying that isn't working } # you could use i in seq_along(L2) # instead of 1:length(L2) if you wanted to for(i in 1:length(L2)){ FUN(L2[[i]], names(L2)[i]) } } WORD.C(list.xy) WORD.C2(list.xy)

236 | P a g e

Working on Dataframes in Lists & Acting on Global Environment Variables


#CREATE A FAKE DATA SET df <- data.frame( x.2=rnorm(25), y.2=rnorm(25), g=rep(factor(LETTERS[1:5]), 5) ) #Strip a Particular Column From Every data Frame in the List LIST <- split(df, df$g) #split it into a list of data frames NAMES <- names(LIST) #save the names of this for later use as they may be stripped LIST <- lapply(seq_along(LIST), function(x) as.data.frame(LIST[[x]])[, 1:2]) LIST #Change All Variable Names of Data Frames in a List LIST <- lapply(LIST, function(x) { names(x) <- unlist(strsplit(names(x)[1:length(names(x))], ".", fixed=T))[c(T, F)] return(x) } ) LIST #Rename All the Data Frames in the List names(LIST) <- NAMES LIST #Assign Data Frames in a List to Objects in The Global Environment lapply(seq_along(LIST), function(x) { assign(c("V", "W", "X", "Y", "Z")[x], LIST[[x]], envir=.GlobalEnv) } ) V; W #etc

#Use Global Assignment to Change All Variable Names of Data Frames in a List lapply(seq_along(LIST), function(x) names(LIST[[x]]) <<unlist(strsplit(names(LIST[[x]])[1:length(names(LIST[[x]]))], ".", fixed=T))[c(T, F)] ) LIST

#Rename All the Data Frames in the List Using Global Assignment lapply(seq_along(LIST), function(x) {names(LIST)[[x]] <<- NAMES[x]}) LIST

237 | P a g e

do.call, replicate, split do.call (take a list, apply a function)


#do.call must be in a list mtcars2 <- as.list(mtcars) #do.call with rbind and dataframe do.call('rbind', mtcars2) do.call('data.frame', mtcars2) #to use with paste we have to pass the separator that paste takes mtcars2$sep <- "HELLO" do.call('paste', mtcars2)

Classic Use of split, lapply and do.call (split by a factor(s), apply a function, put back together)
Note: consider using by and tapply as well LIST <- split(mtcars, mtcars$cyl) MEANS <- lapply(LIST, colMeans) row2col(do.call('rbind', MEANS), 'cyl') #notice we split by two factors LIST2 <- split(mtcars, list(mtcars$cyl, mtcars$carb)) MEANS2 <- lapply(LIST2, colMeans) OC <- row2col(do.call('rbind', MEANS2), 'cyl.carb') replacer(OC, NaN, NA)

Use replicate to repeat a function over and over and then do.call('rbind', ) them together
# Create some fake data. dat <- rnorm(200) # Get a sample of size 5 from this without replacement sample(dat, 5) # Do this 10 times replicate(10, sample(dat, 5)) #replicate finding means replicate(10, colMeans(mtcars)) #replicate and paste a data frame do.call('rbind', replicate(10, data.frame(a=1:10, b=letters[1:10], c=state.name[1:10]), simplify=F))

238 | P a g e

Data Table Sample Stats


require(data.table) dat <- data.table(iris) x <- dat[,list(mean=mean(Sepal.Length), sd=sd(Sepal.Length)),by=Species] rownamer(x)

239 | P a g e

Look Up Tables & Dictionaries #Create a Data Set to Match


test1<-(structure(list(person = structure(1:7, .Label = c("A", "B", "C", "D", "E", "F", "G"), class = "factor"), age = c(7L, 22L, 65L, 32L, 14L, 53L, 23L)), .Names = c("person", "age"), class = "data.frame", row.names = c(NA, -7L))) test2<-(structure(list(Lower_limit = c(5L, 15L, 25L, 45L), Upper_limt = c(15L, 25L, 45L, 100L), support = c(10L, 20L, 30L, 40L)), .Names = c("Lower_limit", "Upper_limt", "support"), class = "data.frame", row.names = c(NA, -4L))) test1; test2

The dataframe
Test1

# Merge (slow)
test1$sup1<-cut(test1$age, breaks=unique(unlist(test2[, 1:2]))) key <- data.frame(sup1=levels(test1$sup1), support=test2$support) test3 <- merge(test1, key, sort=FALSE)[, -1] test3 <- test3[order(test3$person), ] rownames(test3) <- 1:nrow(test3) test3

# Match
test1$sup1<-cut(test1$age, breaks=unique(unlist(test2[, 1:2]))) key <- data.frame(sup1=levels(test1$sup1), support=test2$support) test1$support <- key[match(test1$sup1, key$sup1), 2] test1[, -3]

The dictionary/look up
Test2

# Hash Table
test1$sup1<-cut(test1$age, breaks=unique(unlist(test2[, 1:2]))) key <- data.frame(sup1=levels(test1$sup1), support=test2$support) hash <- function(x, type = "character") { e <- new.env(hash = TRUE, size = nrow(x), parent = emptyenv()) char <- function(col) assign(col[1], as.character(col[2]), envir = e) num <- function(col) assign(col[1], as.numeric(col[2]), envir = e) FUN <- if(type=="character") char else num apply(x, 1, FUN) return(e) } KEY <- hash(key, type="numeric") type <- function(x) if(exists(x, env = KEY))get(x, e = KEY) else NA test1$support <- sapply(as.character(test1$sup1), type) test1[, -3]

# Indexing
test1$sup1<-cut(test1$age, breaks=unique(unlist(test2[, 1:2]))) key2 <- c(test2$support);names(key2) <- levels(test1$sup1) #lookup table transform(test1, support=key2[sup1])[, -3]

# data.table
library(data.table) test1$sup1<-cut(test1$age, breaks=unique(unlist(test2[, 1:2]))) key <- data.frame(sup1=levels(test1$sup1), support=test2$support) dtKEY <- data.table(key, key="sup1") test1$support <- dtKEY[J(test1$sup1), ][[2]] test1[, -3]

# qdap lookup (hash based)


library(qdap) test1$sup1 <-cut(test1$age, breaks=unique(unlist(test2[, 1:2]))) key <- data.frame(sup1=levels(test1$sup1), support=test2$support) test1$support <- lookup(cut(test1$age, c(5, 15, 25, 45, 100)), key)

240 | P a g e

ggplot2 Globally Alter background color


theme_new <- theme_update( panel.background = element_rect(fill="gray20") ) new <- theme_set(theme_new) theme_set(new) ggplot(data = mtcars, aes(mpg, wt)) + geom_point(colour="yellow") theme_new <- theme_update( panel.background = element_rect(fill="red") ) new <- theme_set(theme_new) theme_set(new) ggplot(data = mtcars, aes(mpg, wt)) + geom_point(colour="yellow") theme_set(theme_gray()) ggplot(data = mtcars, aes(mpg, wt)) + geom_point(colour="yellow") + theme_new #not global change

Globally Reset Background Color theme_set(theme_gray()) theme_set(theme_bw()) White Background Gray Grid + theme(panel.grid.major = element_blank()) White Background No Grid + theme_bw() + theme(panel.grid.major=element_blank(),panel.grid.minor=element_blank())
library(ggplot2) x <- ggplot(CO2, aes(x=uptake, group=Plant)) y <- x + geom_density(aes(colour=Plant)) + facet_grid(Type~Treatment) y + theme_bw() + theme(panel.grid.major=element_blank(),panel.grid.minor=element_blank())

Change Back Ground Color + theme(panel.background = element_rect(fill='green', colour='red')) Change Margins Color + theme(plot.background = element_rect(fill='green', colour='red'))

241 | P a g e

Change ggplot2 Pallette and Apply to Individuals + scale_colour_identity() First you creat a color palette using the hexadecimal colors to the right. Then you assign those colors to groups or observations etc. and add the parameter:
+ scale_colour_identity() col <- c("#000000", "#FF0000", "0033000") mtcars$col <- col[1] mtcars$col[5:6] <- col[2:3] p <- ggplot(mtcars, aes(x=wt, y=mpg, label=rownames(mtcars))) p + geom_text(data=mtcars, aes(colour=col), size=2)

Change Color Transparency (saturation (chromaticity) & increase luminance) scale_fill_hue(h = c(0, 360) + 15, l = 65, c = 100)
ggplot(df, aes(x=cond, y=yval, fill=cond)) + geom_bar() + scale_fill_hue(c=15, l=10) ggplot(df, aes(x=cond, y=yval, fill=cond)) + geom_bar() + scale_fill_hue(c=85, l=10) ggplot(df, aes(x=cond, y=yval, fill=cond)) + geom_bar() + scale_fill_hue(c=85, l=90)

242 | P a g e

Map a Numeric Variable on a Color Continuim scale_colour_gradient(low='col1', high='col2')


#Examples gradient_rb <- scale_colour_gradient(low='blue', high='red') p <- ggplot(mtcars, aes(x=wt, y=mpg, label=rownames(mtcars))) p + geom_text(data=mtcars, aes(colour=mpg), size=3)+ gradient_rb #example 2 p + geom_point(data=mtcars, aes(colour=mpg), size=2)+ gradient_rb

Change Color of Factors and Reorder Legend + scale_colour_manual(values = cols)


#Examples p <- ggplot(mtcars, aes(x=wt, y=mpg, label=rownames(mtcars))) w <- p + geom_point(data=mtcars, aes(colour=cyl), size=3) w w + scale_colour_manual(values = c("red","blue", "green")) w + scale_colour_manual( #specify who takes what color values = c("8" = "red","4" = "blue","6" = "green")) cols <- c("8" = "red","4" = "blue","6" = "darkgreen", "10" = "orange") w + scale_colour_manual(values = cols) #breaks allows you to specify which factor gets what color w + scale_colour_manual(values = cols, breaks = c("4", "6", "8")) w + scale_colour_manual(values = cols, breaks = c("8", "6", "4")) w + scale_colour_manual(values = cols, breaks = c("4", "6", "8"), labels = c("four", "six", "eight")) #plot just some of the groups (below 6 cyl not plotted) w + scale_colour_manual(values = cols, limits = c("4", "8"))

243 | P a g e

Change Color Pallette + scale_fill_manual()

for bar graphs (items you fill in)

#EXAMPLE library(ggplot2) cbbFillPalette <- scale_fill_manual(values=c("#000000", "#E69F00", "#56B4E9")) cbbFillPalette2 <- scale_fill_manual(values=c("red", "blue", "brown")) mtcars$cyl <- as.factor(mtcars$cyl) #make cylinder a factor ggplot(mtcars, aes( x=cyl, fill=cyl)) + geom_bar() + cbbFillPalette ggplot(mtcars, aes( x=cyl, fill=cyl)) + geom_bar() + cbbFillPalette2

Adjust Transparency (An argument to many geoms) alpha=


#EXAMPLES library(ggplot2) #EX1 x <- ggplot(mtcars, aes(factor(cyl))) x + geom_bar(fill = "dark grey", colour = "black", alpha = 1/3) #EX12 df <- data.frame(x = rnorm(5000), y = rnorm(5000)) h <- ggplot(df, aes(x,y)) h + geom_point(alpha = 0.5) h + geom_point(alpha = 1/10)

Symbols and Color Fills symbols 16:25 are fillable


x <- ggplot(mtcars, aes(x=hp, y=mpg)) x + geom_point(aes(shape = 21), size = 4, colour = "red", fill = "black")

df2 <- data.frame(x = 1:5 , y = 1:25, z = 1:25) s <- ggplot(df2, aes(x = x, y = y)) s + geom_point(aes(shape = z), size = 4, colour = "red", fill = "black") + scale_shape_identity()

244 | P a g e

Fill By 2 or More Combined Variables


library(ggplot2); library(RColorBrewer) dat <- data.frame(category = c("A","A","B","B","C","C","D","D"), variable = c("inclusion","exclusion","inclusion","exclusion", "inclusion", "exclusion","inclusion","exclusion"), value = c(60,20,20,80,50,55,25,20)) #FILL BY 1 VARIABLE colors <- c("#FF0000","#990000") ggplot(dat, aes(category, value, fill = variable)) + geom_bar()+ scale_fill_manual(values = colors) #FILL BY 2 VARIABLES dat$grp <- paste2(dat[, 1:2], sep=" ")# create a combined variable ggplot(dat, aes(category, value, fill = grp)) + geom_bar()+ scale_fill_manual(values = brewer.pal(8,"Reds"))

245 | P a g e

Annotations and Text Correct Approach to Plotting Annotations (not found in original dataframe)
Create a separate data frame with the text and locations and pass that data frame to geom_text
#Original data frame data2 <- read.table(text= "type value time year 1 NA* 0.90 3 2008 3 EDS 0.01 3 2008 4 KIU 0.01 3 2008 5 MVH 0.09 3 2008 6 LAK 0.00 3 2008 7 NA* 0.80 6 2007 9 EDS 0.05 6 2007 10 KIU 0.00 6 2007 11 MVH 0.15 6 2007 12 LAK 0.00 6 2007 13 NA* 0.41 15 2007 15 EDS 0.04 15 2007 16 KIU 0.03 15 2007 17 MVH 0.52 15 2007 18 LAK 0.00 15 2007 19 NA* 0.23 27 2006 21 EDS 0.11 27 2006 22 KIU 0.02 27 2006 23 MVH 0.64 27 2006 24 LAK 0.01 27 2006", header=T) #create separate text data frame data2.labels <- data.frame( time = c(7, 15), value = c(.9, .6), label = c("correct color", "another correct color!"), type = c("NA*", "MVH") ) ggplot(data2, aes(x=time, y=value, group=type, col=type))+ geom_line()+ geom_point()+ theme_bw() + #pass the new data frame to geom_text so it doesn't print 1000x geom_text(data = data2.labels, aes(x = time, y = value, label = label))

Grid Letters and Greek Text Separate the words with the tilde (~) symbol.
d <- data.frame(x=1:3,y=1:3) qplot(x, y, data=d) + geom_text(aes(2, 2, label="rho~and~some~other~text"), parse=TRUE)

246 | P a g e

Ajust Size Difference Ratio + scale_size (range = c(x, y)) + scale_size_continuous(range = c(x, y))
p <- ggplot(mtcars, aes(hp, as.factor(cyl))) + geom_point(aes(size=mpg)) p p + scale_size(range = c(2, 10)) p + scale_size_continuous(range = c(3,8)) p + scale_size_continuous(range = c(.05,15))

Change Aspect Ratio Of the Plot Region + coord_equal(ratio = 5) qplot(mpg, wt, data = mtcars) + coord_equal(ratio = 5) qplot(mpg, wt, data = mtcars) + coord_equal(ratio = 1) qplot(mpg, wt, data = mtcars) + coord_equal(ratio = 1/5)

247 | P a g e

Add a title

sggplot title ggplot title

+ ggtitle("Title text") + labs(title="Title text")

248 | P a g e

Legends Legend Manipulation + guides()

sggplot legends sggplot2 legends

library(reshape2) # for melt df <- melt(outer(1:4, 1:4), varnames = c("X1", "X2")) p1 <- ggplot(df, aes(X1, X2)) + geom_tile(aes(fill = value)) # Basic form p1 + scale_fill_continuous(guide = "legend") p1 + scale_fill_continuous(guide = guide_legend()) # Guide title p1 + scale_fill_continuous(guide = guide_legend(title = "V")) # title text p1 + scale_fill_continuous(name = "V") # same p1 + scale_fill_continuous(guide = guide_legend(title = NULL)) # no title # Control styles # key size p1 + guides(fill = guide_legend(keywidth = 3, keyheight = 1)) # title position p1 + guides(fill = guide_legend(title = "LEFT", title.position = "left")) # title text styles via theme_text p1 + guides(fill = guide_legend( title.theme = theme_text(size=15, face="italic", col="red", angle=45))) p1 + guides(fill = guide_legend(label.position = "bottom")) # label styles p1 + scale_fill_continuous(breaks = c(5, 10, 15), labels = paste("long", c(5, 10, 15)), guide = guide_legend(direction = "horizontal", title.position = "top", label.position="bottom", label.hjust = 0.5, label.vjust = 0.5, label.theme = theme_text(angle = 90))) # Set aesthetic of legend key # very low alpha value make it difficult to see legend key p3 <- qplot(carat, price, data = diamonds, colour = color, alpha = I(1/100)) p3 # override.aes overwrites the alpha p3 + guides(colour = guide_legend(override.aes = list(alpha = 1))) # multiple row/col legends p <- qplot(1:20, 1:20, colour = letters[1:20]) p + guides(col = guide_legend(nrow = 8)) p + guides(col = guide_legend(ncol = 8)) p + guides(col = guide_legend(nrow = 8, byrow = TRUE)) p + guides(col = guide_legend(ncol = 8, byrow = TRUE)) # reversed order legend p + guides(col = guide_legend(reverse = TRUE))

249 | P a g e

Change Legend Title + labs(shape, cour, fill, line, shape, etc.)


library(ggplot2) data(iris) ggplot(iris, aes(x=Sepal.Length, y=Sepal.Width)) + geom_point(aes(shape=Species, colour=Petal.Width)) + scale_colour_gradient() + labs(shape="Species label", colour="Petal width label")

Change Legend Position + theme(legend.position = 'left') #directional input + theme(legend.position = c(0.5, 0.5)) #coordinate input
library(ggplot2) xy <- data.frame(x=1:10, y=10:1, type = rep(LETTERS[1:2], each=5)) plot <- ggplot(data = xy)+ geom_point(aes(x = x, y = y, color=type)) plot plot + theme(legend.position = 'left') plot + theme(legend.position = 'bottom') plot + theme(legend.position = c(0.5, 0.5)) plot + theme(legend.position = c(0.9, 0.9))

Eliminate Legend + theme(legend.position = "none")

250 | P a g e

Share Legend
p1 <- ggplot(subset(mtcars, cyl = 4), aes(wt, cyl, colour = mpg)) + geom_point() p2 <- ggplot(subset(mtcars, cyl = 8), aes(wt, hp, colour = mpg)) + geom_point() + guides(colour=FALSE) library(gridExtra) grid.draw(cbind(ggplotGrob(p2), ggplotGrob(p1), size="last"))

## Make a tableGrob of your legend tmp <- ggplot_gtable(ggplot_build(p2)) leg <- which(sapply(tmp$grobs, function(x) x$name) == "guide-box") legend <- tmp$grobs[[leg]] # Plot objects using widths and height and respect to fix aspect ratios # We make a grid layout with 3 columns, one each for the plots and one for the legend grid.newpage() pushViewport( viewport( layout = grid.layout( 1 , 3 , widths = unit( c( 0.4 , 0.4 , 0.2 ) , "npc" ) ,heights = unit( c( 0.45 , 0.45 , 0.45 ) , "npc" ) , respect = matrix(rep(1,3),1) ) ) ) print( p1 + theme(legend.position="none") , vp = viewport( layout.pos.row = 1 , layout.pos.col = 1 ) ) print( p2 + theme(legend.position="none") , vp = viewport( layout.pos.row = 1, layout.pos.col = 2 ) ) upViewport(0) vp3 <- viewport( width = unit(0.2,"npc") , x = 0.9 , y = 0.5) pushViewport(vp3) grid.draw(legend) popViewport()

251 | P a g e

Continuous Legend + guides(fill = guide_colorbar())


library(reshape2) # for melt df <- melt(outer(1:4, 1:4), varnames = c("X1", "X2")) p1 <- ggplot(df, aes(X1, X2)) + geom_tile(aes(fill = value)) p1 + guides(fill = guide_colorbar(barwidth = 0.5, barheight = 10)) p1 + guides(fill = guide_colorbar(label = FALSE)) p1 + guides(fill = guide_colorbar(ticks = FALSE)) p1 + guides(fill = guide_colorbar(label.position = "left")) p1 + guides(fill = guide_colorbar(label.theme = theme_text(col="blue"))) p1 + scale_fill_continuous(limits = c(0,20), breaks=c(0, 5, 10, 15, 20), guide = guide_colorbar(nbin=100, draw.ulim = FALSE, draw.llim = FALSE)) p1 + guides(fill = guide_colorbar(direction = "horizontal", label.theme = theme_text(col="blue")))

Reverse Order Legend + guides(fill = guide_legend(reverse = TRUE))

#EXAMPLE library(ggplot2) p <- ggplot(diamonds, aes(clarity, fill=cut)) + geom_bar() p p + guides(fill = guide_legend(reverse = TRUE))

Change Legend Symbols library(grid) grid.gedit("^key-[-0-9]+$", label = "NEW_SYMBOL")


#EXAMPLE df <- expand.grid(x = factor(seq(1:5)), y = factor(seq(1:5)), KEEP.OUT.ATTRS = FALSE) df$Count <- seq(1:25) # A plot library(ggplot2) p <- ggplot(data = df, aes( x = x, y = y, label = Count, size = Count)) + geom_text() + scale_size(range = c(2, 10)) p library(grid) grid.gedit("^key-[-0-9]+$", label = ":)")

252 | P a g e

Custom Legend
library(ggplot2) df <- data.frame(gp = factor(rep(letters[1:3], each = 10)), y = rnorm(30)) library(plyr) ds <- ddply(df, .(gp), summarise, mean = mean(y), sd = sd(y)) ggplot(df, aes(x = gp, y = y)) + geom_point(aes(colour="data")) + geom_point(data = ds, aes(y = mean, colour = "mean"), size = 3) + scale_colour_manual("Legend", values=c("mean"="red", "data"="black")) library(reshape2) # in long format dsl <- melt(ds, value.name = 'y') # add variable column to df data.frame df[['variable']] <- 'data' # combine all_data <- rbind(df,dsl) # drop sd rows

data_w_mean <- subset(all_data,variable != 'sd',drop = T) # create vectors for use with scale_..._manual colour_scales <- setNames(c('black','red'),c('data','mean')) size_scales <- setNames(c(1,3),c('data','mean') ) ggplot(data_w_mean, aes(x = gp, y = y)) + geom_point(aes(colour = variable, size = variable)) + scale_colour_manual(name = 'Type', values = colour_scales) + scale_size_manual(name = 'Type', values = size_scales) dsl_mean <- subset(dsl,variable != 'sd',drop = T) ggplot(df, aes(x = gp, y = y, colour = variable, size = variable)) + geom_point() + geom_point(data = dsl_mean) + scale_colour_manual(name = 'Type', values = colour_scales) + scale_size_manual(name = 'Type', values = size_scales)

Remove Diagonal Lines show_guide=FALSE


ggplot(mtcars, aes(factor(cyl), fill=am, group=am)) + geom_bar(colour="black") ggplot(mtcars, aes(factor(cyl), fill=am, group=am)) + geom_bar() + geom_bar(colour="black", show_guide=FALSE)

253 | P a g e

Eliminate Vertical/Horizontal Grid Lines


#MAKE A DATA SET library(ggplot2); set.seed(10) CO3 <- data.frame(id=1:nrow(CO2), CO2[, 2:3], outcome=factor(sample(c('none', 'some', 'lots', 'tons'), nrow(CO2), rep=T), levels=c('none', 'some', 'lots', 'tons'))) x <- ggplot(CO3, aes(x=outcome)) + geom_bar(aes(x=outcome))+ facet_grid(Treatment~Type, margins='Treatment', scales='free') + theme_bw() + theme(axis.text.x=element_text(angle= 45, vjust=1, hjust= 1)) #REMOVE LINES x + theme(panel.grid.major.y = element_blank(), panel.grid.minor.y = element_blank()) x + theme(panel.grid.major.x = element_blank(), panel.grid.minor.x = element_blank())

Equal distance between bars


df <- structure(list(ID = structure(c(1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L, 4L, 4L, 4L, 4L, 4L, 4L, 5L, 5L, 5L, 5L, 5L, 5L, 6L, 6L, 6L, 6L, 6L, 6L, 7L, 7L, 7L), .Label = c("1", "2", "3", "4", "5", "6", "7"), class = "factor"), TYPE = structure(c(1L, 2L, 3L, 4L, 5L, 1L, 2L, 3L, 4L, 5L, 6L, 1L, 2L, 3L, 4L, 5L, 6L, 1L, 2L, 3L, 4L, 5L, 6L, 1L, 2L, 3L, 4L, 5L, 6L, 1L, 2L, 3L, 4L, 5L, 6L, 1L, 2L, 3L), .Label = c("1", "2", "3", "4", "5", "6", "7", "8"), class = "factor"), TIME = structure(c(2L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 1L, 1L), .Label = c("1", "5", "15"), class = "factor"), VAL = c(0.94, 0.52, 0.28, 0.97, 0.12, 0.05, 0.47, 0.62, 0.2, 0.73, 1, 0.98, 0.67, 0.29, 0.17, 0.86, 0.17, 0.83, 0.62, 0.79, 0.76, 0.43, 0.61, 0.18, 0.53, 0.49, 0.47, 0.07, 0.7, 0.23, 0.36, 0.52, 0.26, 0.15, 0.01, 0.46, 0.92, 0.23), w = c(0.675, 0.675, 0.675, 0.675, 0.675, 0.9, 0.9, 0.9, 0.9, 0.9, 0.9, 0.9, 0.9, 0.9, 0.9, 0.9, 0.9, 0.675, 0.675, 0.675, 0.675, 0.675, 0.675, 0.9, 0.9, 0.9, 0.9, 0.9, 0.9, 0.675, 0.675, 0.675, 0.675, 0.675, 0.675, 0.9, 0.9, 0.9)), .Names = c("ID", "TYPE", "TIME", "VAL", "w"), row.names = c(NA, -38L), class = "data.frame")

ggplot(df, aes(x=ID, y=VAL, fill=TYPE)) + facet_wrap(~TIME, ncol=1) + geom_bar(position="stack",stat = "identity") + coord_flip() ggplot(df, aes(x=ID, y=VAL, fill=TYPE)) + facet_wrap(~TIME, ncol=1, scale="free") + geom_bar(position="stack",stat = "identity") + coord_flip() df$w <- 0.9 df$w[df$TIME == 5] <- 0.9 * 3/4 ggplot(df, aes(x=ID, y=VAL, fill=TYPE)) + facet_wrap(~TIME, ncol=1, scale="free") + geom_bar(position="stack",aes(width = w),stat = "identity") + coord_flip()

254 | P a g e

Faceting Faceted Plot


library(ggplot2) qplot(mpg, wt, data=mtcars) + facet_grid(cyl ~ vs)

Faceted Plot Margins (including plotting just one margin)


library(ggplot2) qplot(mpg, wt, data=mtcars) + facet_grid(cyl ~ vs, margins=TRUE) qplot(mpg, wt, data=mtcars) + facet_grid(cyl ~ vs, margins='vs') qplot(mpg, wt, data=mtcars) + facet_grid(cyl ~ vs, margins='cyl')

255 | P a g e

Facet Labels on Top library(ggplot2) + facet_wrap(~Species, ncol=1) + facet_wrap(~Species,nrow = 3)


library(ggplot2) ggplot(iris, aes(Petal.Length)) ggplot(iris, aes(Petal.Length)) ggplot(iris, aes(Petal.Length)) ggplot(iris, aes(Petal.Length)) + + + + stat_bin() stat_bin() stat_bin() stat_bin() + + + + facet_grid(Species ~ .) facet_wrap(~Species,nrow = 3) facet_wrap(~Species, ncol=1) facet_wrap(~Species,nrow = 4)

Change Facet Labels #Data Source and plot


library(ggplot2); library(directlabels) x <- ggplot(CO2, aes(x=uptake, group=Plant)) y <- x + geom_density(aes(colour=Plant)) y + facet_grid(Type~Treatment)

#method 1 Does not alter data


mf_labeller <- function(var, value){ value <- as.character(value) if (var=="Treatment") { value[value=="nonchilled"] <- "Var 1" value[value=="chilled"] <- "Var 2" } return(value) } y + facet_grid(Type~Treatment, labeller=mf_labeller)

#method 2 Faster but does alter data


levels(CO2$Treatment) <- c("Var 1", "Var 2") library(ggplot2); library(directlabels) x <- ggplot(CO2, aes(x=uptake, group=Plant)) y <- x + geom_density(aes(colour=Plant)) y + facet_grid(Type~Treatment)

256 | P a g e

Change (all) facet_grid


label_renamemargin_gen <- function(newname="Total") { function(variable, value) { value <- as.character(value) value[value == "(all)"] <- newname value } } ggplot(mtcars, aes(cyl)) + geom_point(stat="bin", size = 2, aes(shape = gear, position = "stack")) + facet_grid(carb ~ gear, margins = TRUE, labeller=label_renamemargin_gen("Total"))

Adjust Facet Labels and Boxes


library(ggplot2) x <- ggplot(CO2, aes(x=uptake, group=Plant)) x + geom_density(aes(colour=Plant)) + facet_grid(Type~Treatment)+ theme(strip.text.x = element_text(size=8, angle=75), strip.text.y = element_text(size=12, face="bold"), strip.background = element_rect(colour="red", fill="#CCCCFF"))

Eliminate Background Color and Maintain Facet Boxes + theme_bw()


ggplot(CO2, aes(conc)) + geom_density() + facet_grid(Type~Treatment) + theme(panel.background = element_blank())
#basically don't use panel.background for this

ggplot(CO2, aes(conc)) + geom_density() + facet_grid(Type~Treatment) + #theme(panel.background = element_blank()) + theme_bw()

257 | P a g e

Annotate one box in facet_grid


library(ggplot2) p <- ggplot(mtcars, aes(mpg, wt)) + geom_point() p <- p + facet_grid(. ~ cyl) #create a new data frame with the info ann_text <- data.frame(mpg = 15,wt = 5,lab = "Text", cyl = factor(8,levels = c("4","6","8"))) p + geom_text(data = ann_text,label = "Text")

Annotate every box in facet_grid


#make a few numeric into factors
mtcars[, c("cyl", "am", "gear")] <- lapply(mtcars[, c("cyl", "am", "gear")], as.factor) #plot it with no annotations p <- ggplot(mtcars, aes(mpg, wt, group = cyl)) + geom_line(aes(color=cyl)) + geom_point(aes(shape=cyl)) + facet_grid(gear ~ am) + theme_bw() p #find number of facets len <- length(levels(mtcars$gear)) * length(levels(mtcars$am))

1 2 3 4 5 6

x 15 15 15 15 15 15

y gear am labs 5 3 0 A 5 4 0 B 5 5 0 C 5 3 1 D 5 4 1 E 5 5 1 F

#make a data frame with coordinates, facet variable levels, labels vars <- data.frame(expand.grid(levels(mtcars$gear), levels(mtcars$am))) colnames(vars) <- c("gear", "am") dat <- data.frame(x = rep(15, len), y = rep(5, len), vars, labs=LETTERS[1:len]) #use geom_text to annotate (notice group=NULL) p + geom_text(aes(x, y, label=labs, group=NULL),data=dat) #change just one location dat[1, 1:2] <- c(30, 2) #to change specific locations p + geom_text(aes(x, y, label=labs, group=NULL), data=dat) #use math plotting p + geom_text(aes(x, y, label=paste("beta ==", labs), group=NULL), size = 4, color = "grey50", data=dat, parse = T)

258 | P a g e

Axis Adjustments Eliminate Space at Bottom of Barplots + scale_y_continuous(expand = c(0,0))


#EXAMPLE qplot(1:10, geom = 'bar') qplot(1:10, geom = 'bar') + scale_y_continuous(expand = c(0,0))

Reverse Axis sflip axis + coord_flip()

#Examples qplot(cut, price, data=diamonds, geom="boxplot") last_plot() + coord_flip() qplot(cut, data=diamonds, geom="bar") last_plot() + coord_flip()

259 | P a g e

Adjust Axis Labels + theme(axis.title.x = element_text(vjust=-0.5)) #vertical + theme(axis.title.x = element_text(hjust=0.25)) #horizontal Axis Labels Names + labs(x = "x", y = "y") OR + xlab("x") + ylab("y")
p p p # p <- qplot(mpg, wt, data = mtcars) + xlab("Vehicle Weight") + ylab("Miles per Gallon") Or + labs(x = "Vehicle Weight", y = "Miles per Gallon")

260 | P a g e

Dendrograms with ggplot library(ggplot); library(ggdendro)


library(ggplot2) library(ggdendro) data(mtcars) x <- as.matrix(scale(mtcars)) dd.row <- as.dendrogram(hclust(dist(t(x)))) ddata_x <- dendro_data(dd.row) p <- ggplot(segment(ddata_x)) + geom_segment(aes(x=x, y=y, xend=xend, yend=yend)) + scale_y_continuous(trans = 'reverse') p + geom_text(data=label(ddata_x), aes(label=label, x=x, y=0), hjust=0) + coord_flip()

Initial Between Variable Data Visualization scatterplot matrix


library(ggplot2) library(GGally) ggpairs(iris, colour='Species', alpha=0.4) ggpairs(CO2, colour ='Type', alpha=0.4) mtcars$cyl <- factor(mtcars$cyl) ggpairs(mtcars, colour ='cyl', alpha=0.4)

261 | P a g e

Combine Two plots (even facetted plots) library(gridExtra) grid.arrange(plot.1, ..., plot.n)
library(ggplot2)
p1 <- ggplot(mtcars[mtcars$cyl!=8,], aes(mpg, wt))+ geom_point()+ facet_wrap( ~ cyl) p2 <- ggplot(mtcars[mtcars$cyl!=8,], aes(mpg, wt))+ geom_point()+ facet_grid(am ~ cyl)+ theme( axis.text.y = element_blank(), axis.text.x = element_blank(), axis.title.y = element_blank(), axis.ticks = element_blank(), #strip.background = element_blank(), strip.text.x = element_blank()) library(gridExtra) grid.arrange(p1,p2, main ="this is a title", left = "This is my global Y-axis title")

Nice reference to this: http://stackoverflow.com/questions/8112208/how-can-i-obtain-an-unbalanced-grid-of-ggplots

262 | P a g e

Add a table to a grid plot #1 + annotation_custom(grob, xmin = -Inf, xmax = Inf, ymin = -Inf, ymax = Inf)

Add a table to a grid plot (can't superimpose) library (gridExtra) tableGrob() ?tableGrob

Add Table to Plot (control widths)


my_hist<-ggplot(diamonds, aes(clarity, fill=cut)) + geom_bar() my_table<- tableGrob(head(diamonds)[,1:3], gpar.coretext = gpar(fontsize=8),gpar.coltext=gpar(fontsize=8), gpar.rowtext=gpar(fontsize=8)) grid.arrange(my_hist,my_table, ncol=2) grid.arrange(my_hist,my_table, ncol=2, widths=c(.7, .3))

Add a table right below a legend


my_hist<-ggplot(diamonds, aes(clarity, fill=cut)) + geom_bar() #create inset table my_table<- tableGrob(head(diamonds)[,1:3], gpar.coretext =gpar(fontsize=8), gpar.coltext=gpar(fontsize=8), gpar.rowtext=gpar(fontsize=8)) #Extract Legend g_legend<-function(a.gplot){ tmp <- ggplot_gtable(ggplot_build(a.gplot)) leg <- which(sapply(tmp$grobs, function(x) x$name) == "guide-box") legend <- tmp$grobs[[leg]] return(legend)} legend <- g_legend(my_hist) #Create the viewports, push them, draw and go up grid.newpage() vp1 <- viewport(width = 0.75, height = 1, x = 0.375, y = .5) vpleg <- viewport(width = 0.25, height = 0.5, x = 0.85, y = 0.75) subvp <- viewport(width = 0.3, height = 0.3, x = 0.85, y = 0.25) print(my_hist + theme(legend.position = "none"), vp = vp1) upViewport(0) pushViewport(vpleg) grid.draw(legend) #Make the new viewport active and draw upViewport(0) pushViewport(subvp) grid.draw(my_table)

263 | P a g e

Add text to a bar plot


EXAMPLE 1: Above Bars library(ggplot2) mtcars2 <- data.frame(id=1:nrow(mtcars), mtcars[, c(2, 8:11)]) mtcars2[, -1] <- lapply(mtcars2[, -1], as.factor) with(mtcars2, ftable(cyl, gear, am)) #USE FOR FREQUENCY COUNTS OF ANY VARAIBLE ggplot(mtcars2, aes(x=cyl)) + geom_bar() + facet_grid(gear~am) + stat_bin(geom="text", aes(label=..count.., vjust=-1)) EXAMPLE 2: On Stacked Bar Year <- c(rep(c("2006-07", "2007-08", "2008-09", "2009-10"), each = 4)) Category <- c(rep(c("A", "B", "C", "D"), times = 4)) Frequency <- c(168, 259, 226, 340, 216, 431, 319, 368, 423, 645, 234, 685, 166, 467, 274, 251) Data <- data.frame(Year, Category, Frequency) library(ggplot2) p <- qplot(Year, Frequency, data = Data, geom = "bar", fill = Category, theme_set(theme_bw())) p + geom_text(aes(label = Frequency), size = 3, hjust = 0.5, vjust = 3, position = "stack") EXAMPLE 3: Centered on Stacked Bar Year <- c(rep(c("2006-07", "2007-08", "2008-09", "2009-10"), each = 4)) Category <- c(rep(c("A", "B", "C", "D"), times = 4)) Frequency <- c(168, 259, 226, 340, 216, 431, 319, 368, 423, 645, 234, 685, 166, 467, 274, 251) Data <- data.frame(Year, Category, Frequency) library(ggplot2) ggplot(Data, aes(x = Year, y = Frequency)) + geom_bar(aes(fill = Category)) + geom_text(aes(label = Frequency, y = pos), size = 3)

Add text to Barplots (negative and positive values)


library(plyr);library(ggplot2);library(scales) dtf <- data.frame(x = c("ETB", "PMA", "PER", "KON", "TRA", "DDR", "BUM", "MAT", "HED", "EXP"), y = c(.02, .11, -.01, -.03, -.03, .02, .1, -.01, -.02, 0.06)) ggplot(dtf, aes(x, y)) + geom_bar(stat = "identity", aes(fill = x), legend = FALSE) + geom_text(aes(label = paste(y * 100, "%"), vjust = ifelse(y >= 0, -.2, 1.1))) + scale_y_continuous("Anteil in Prozent", labels = percent_format()) + theme(axis.title.x = element_blank())

264 | P a g e

stat_summary [ggplot2] Alter boxplot ends


library(ggplot2) data(mpg) #Create a function to calculate the points get_tails <- function(x) { q1 = quantile(x)[2] q3 = quantile(x)[4] iqr = q3 -q1 upper = q3+1.5*iqr lower = q1-1.5*iqr ##Trim upper and lower up = max(x[x < upper]) lo = min(x[x > lower]) return(c(lo, up)) } ggplot(mpg, aes(x=drv,y=hwy)) + geom_boxplot() + stat_summary(geom="point", fun.y= get_tails, colour="Red", shape=3, size=5)

265 | P a g e

Add Colored Rectangles in Background

geom_rect()
#EXAMPLE scores <- data.frame(category = 1:4, percentage = c(34,62,41,44), type = c("a","a","a","b")) rects <- data.frame(ystart = c(0,25,45,65,85), yend = c(25,45,65,85,100), #the y values to stop and start coloring col = c("Z1","Z2","Z3","Z4","Z5")) #the "grouping" variable to color on labels <- c("ER", "OP", "PAE", "Overall") #labels for the x axis medals <- c("navy","goldenrod4","darkgrey","gold","cadetblue1") #rectangle colors library(ggplot2) ggplot() + geom_rect(data = rects, aes(xmin = -Inf, xmax = Inf, ymin = ystart, ymax = yend, fill=col), alpha = 0.3) + theme(legend.position="none") + geom_bar(data=scores, aes(x=category, y=percentage, fill=type), stat="identity") + scale_fill_manual(values=c("indianred1", "indianred4", medals)) + scale_x_continuous(breaks = 1:4, labels = labels)

266 | P a g e

Labels Labels above bars scale_y_continuous(labels = percent)


#DATA SET df <- structure(list(A = structure(c(1L, 1L, 1L, 2L, 2L, 2L, 3L, 3L, 3L), .Label = c("0-50,000", "50,001-250,000", "250,001-Over"), class = "factor"), B = structure(c(1L, 2L, 3L, 1L, 2L, 3L, 1L, 2L, 3L), .Label = c("0-50,000", "50,001-250,000", "250,001-Over"), class = "factor"), Freq = c(0.507713884992987, 0.258064516129032, 0.23422159887798, 0.168539325842697, 0.525280898876405, 0.306179775280899, 0.160958904109589, 0.243150684931507, 0.595890410958904)), .Names = c("A", "B", "Freq"), class = "data.frame", row.names = c(NA, -9L)) library(ggplot2); library(scales) ggplot(data=df, aes(x=A, y=Freq))+ geom_bar(aes(fill=B), position = position_dodge()) + geom_text(aes(label = paste(sprintf("%.1f", Freq*100), "%", sep=""), y = Freq+0.015, group=B), size = 3, position = position_dodge(width=0.9)) + scale_y_continuous(labels = percent) + theme_bw()

267 | P a g e

Mapping Maps for Different Coordinate Systems + coord_map()

require("maps") states <- data.frame(map("state", plot=FALSE)[c("x","y")]) (usamap <- qplot(x, y, data=states, geom="path")) usamap + coord_map() usamap + coord_map(project="orthographic") usamap + coord_map(project="stereographic") usamap + coord_map(project="conic", lat0 = 30) usamap + coord_map(project="bonne", lat0 = 50)

268 | P a g e

Random Stacked Bar Histogram


# Create data Set set.seed(3421) library(plyr); library(ggplot2) # added type to mimick which candidate is supported dfr <- data.frame( name = LETTERS[1:26], percent = rnorm(26, mean=15), type = sample(c("A", "B"), 26, replace = TRUE) ) # easier to prepare data in advance. uses two ideas # 1. calculate histogram bins (quite flexible) # 2. calculate frequencies and label positions dfr <- transform(dfr, perc_bin = cut(percent, 5)) dfr <- ddply(dfr, .(perc_bin), mutate, freq = length(name), pos = cumsum(freq) - 0.5*freq) # start plotting. key steps are # 1. plot bars, filled by type and grouped by name # 2. plot labels using name at position pos # 3. get rid of grid, border, background, y axis text and lables ggplot(dfr, aes(x = perc_bin)) + geom_hline(yintercept=seq(10, 70, by=10), colour="gray90", size=.05) + geom_bar(aes(y = freq, group = name, fill = type), colour = 'gray60', show_guide = F) + geom_text(aes(y = pos, label = name), colour = 'white') + scale_fill_manual(values = c('red', 'orange')) + theme_bw() + xlab("") + ylab("") + scale_y_continuous(expand = c(0,0))+ theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank())

Show Zero Count (discrete/categorical) + scale_x_discrete(drop=F)


#EXAMPLES library(ggplot2) mtcars$cyl<-factor(mtcars$cyl) levels(mtcars[!mtcars$cyl==4,]$cyl) #level 4 there but won't be plotted ggplot(mtcars[!mtcars$cyl==4,], aes(cyl))+ geom_bar() ggplot(mtcars[!mtcars$cyl==4,], aes(cyl))+ geom_bar() + scale_x_discrete(drop=F)

Histogram That Matches Base sggplothistogram


right = TRUE
#EXAMPLE ggplot(diamonds, aes(carat, ..density..)) + geom_histogram(binwidth = 0.2) + facet_grid(.~cut) ggplot(diamonds, aes(carat, ..density..)) + geom_histogram(binwidth = 0.2, right = TRUE) + facet_grid(.~cut)

269 | P a g e

Market Share Plot


library(ggplot2) library(reshape2) library(scales) # A DATA SET Subset <- structure(list(Year = 1995:2011, BDMP = c(18L, 19L, 41L, 30L, 18L, 36L, 28L, 33L, 37L, 45L, 36L, 27L, 27L, 26L, 47L, 43L, 45L ), JMP = c(257L, 370L, 550L, 690L, 865L, 1060L, 1190L, 1430L, 1710L, 2070L, 2520L, 2970L, 3400L, 3830L, 4170L, 4680L, 5590L ), Minitab = c(1150L, 1290L, 1400L, 1460L, 1670L, 1890L, 2180L, 2490L, 2860L, 3300L, 3770L, 4590L, 5210L, 5830L, 6510L, 7190L, 7990L), SPSS = c(6450L, 7600L, 10500L, 14500L, 24300L, 45600L, 67200L, 87200L, 75900L, 137000L, 145000L, 141000L, 133000L, 119000L, 61500L, 45700L, 33200L), SAS = c(8630L, 8700L, 10200L, 11100L, 12700L, 16500L, 21900L, 27200L, 39600L, 49400L, 57000L, 62800L, 60400L, 59100L, 53700L, 43000L, 32300L), Stata = c(22L, 91L, 205L, 322L, 516L, 784L, 986L, 1290L, 1740L, 2400L, 3090L, 4010L, 5100L, 6330L, 7600L, 9230L, 12000L), Statistica = c(3L, 11L, 19L, 28L, 23L, 42L, 62L, 84L, 89L, 146L, 165L, 219L, 209L, 249L, 297L, 351L, 413L), Systat = c(2480L, 2510L, 3390L, 2700L, 2650L, 2780L, 2880L, 2900L, 3100L, 3340L, 4000L, 4870L, 5430L, 6270L, 6560L, 7030L, 8060L), R = c(8L, 2L, 6L, 13L, 25L, 51L, 133L, 286L, 627L, 1180L, 2180L, 3430L, 5060L, 6960L, 9150L, 11400L, 14500L), SPlus = c(8L, 17L, 33L, 39L, 45L, 52L, 159L, 341L, 574L, 817L, 1010L, 1180L, 1160L, 1180L, 970L, 710L, 644L)), .Names = c("Year", "BDMP", "JMP", "Minitab", "SPSS", "SAS", "Stata", "Statistica", "Systat", "R", "SPlus"), class = "data.frame", row.names = c(NA, -17L)) Scholar Little6 <- c("JMP","Minitab","Stata","Statistica","Systat","R") Subset <- Scholar[ , Little6] Year <- rep(Scholar$Year, length(Subset)) ScholarLong <- melt(Subset) names(ScholarLong) <- c("Software", "Hits") ScholarLong <- data.frame(Year, ScholarLong) ggplot(ScholarLong, aes(Year, Hits, group=Software)) + geom_smooth(aes(fill=Software), position="fill") + coord_flip()+ scale_x_continuous("Year", trans="reverse") + scale_y_continuous("Proportion of Google Scholar Hits For Each Software", labels = NULL)+ theme(title = expression("Market Share"), axis.ticks = element_blank())

270 | P a g e

Dotplot
DF <- structure(list(Country = structure(1:30, .Label = c("Georgia", "South Africa", "Colombia", "Cuba", "Poland", "Romania", "Taipei (Chinese Taipei)", "Azerbaijan", "Belgium", "Canada", "Republic of Moldova", "Norway", "Serbia", "Slovakia", "Ukraine", "Uzbekistan", "Kazakhstan", "Netherlands", "Great Britain", "Democratic People's Republic of Korea", "Australia", "Brazil", "Hungary", "France", "Russian Federation", "Republic of Korea", "Japan", "Italy", "United States of America", "People's Republic of China"), class = "factor"), Gold = c(1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 1, 0, 2, 1, 1, 1, 2, 1, 2, 0, 2, 3, 6), Silver = c(0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 1, 1, 1, 0, 1, 2, 3, 5, 4 ), Bronze = c(0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 3, 2, 3, 2, 3, 2), total = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 3, 3, 3, 3, 4, 4, 5, 5, 7, 11, 12)), .Names = c("Country", "Gold", "Silver", "Bronze", "total"), row.names = c(13L, 14L, 17L, 18L, 19L, 20L, 21L, 22L, 23L, 24L, 25L, 26L, 27L, 28L, 29L, 30L, 7L, 11L, 16L, 6L, 8L, 9L, 10L, 5L, 12L, 4L, 15L, 3L, 2L, 1L), class = "data.frame") #CONVERT TO LONG DF2 <- reshape(DF, varying = 2:5, direction="long",v.names = "number", timevar = "medal", idvar = "Country", times =c("Gold", "Silver", "Bronze", "total")) DF2$medal <- factor(DF2$medal, levels=c("Bronze", "Silver", "Gold", "total")) ggplot(DF2, aes(x = number, y = Country, colour = medal)) + geom_point() + facet_grid(.~medal) + theme_bw()+ scale_colour_manual(values=c("#CC6600", "#999999", "#FFCC33", "#000000"))

271 | P a g e

Direct Labels Move Specific Labels Around


#The faceted ggplot code library(ggplot2); library(directlabels) x <- ggplot(CO2, aes(x=uptake, group=Plant)) y <- x + geom_density(aes(colour=Plant)) + facet_grid(Type~Treatment)+ theme_bw() y #with a legend direct.label(y) #with direct labels #use this to supply arguments to direct.label to move it around my.method1 <list('top.points', dl.move("Qn1", hjust=0,vjust=-5), dl.move("Qc2", hjust=6,vjust=-8) ) direct.label(y, my.method1) #moved labels

Find Values from Plot and Adjust that Way


GL("ggplot2"); GL(directlabels) set.seed(124234345) # Generate data df.2 <- data.frame("n_gram" = c("word1"), "year" = rep(100:199), "match_count" = runif(100 ,min = 1000 , max = 2000)) df.2 <- rbind(df.2, data.frame("n_gram" = c("word2"), "year" = rep(100:199), "match_count" = runif(100 ,min = 1000 , max = 2000))) # Function to get last Y-value from loess funcDlMove <- function (n_gram) { model <- loess(match_count ~ year, df.2[df.2$n_gram==n_gram,], span=0.3) Y <- model$fitted[length(model$fitted)] Y <- dl.move(n_gram, y=Y,x=200) return(Y) } index <- unique(df.2$n_gram) mymethod <- list( "top.points", lapply(index, funcDlMove) ) # Plot PLOT <- ggplot(df.2, aes(year, match_count, group=n_gram, color=n_gram)) + geom_line(alpha = I(7/10), color="grey", show_guide=F) + stat_smooth(size=2, span=0.3, se=F, show_guide=F) direct.label(PLOT, mymethod)

272 | P a g e

Move Plot Over to Add Line Names


p_load(ggplot2, directlabels) set.seed(124234345) # Generate data df.2 <- data.frame("n_gram" = c("word1"), "year" = rep(100:199), "match_count" = runif(100 ,min = 1000 , max = 2000)) df.2 <- rbind(df.2, data.frame("n_gram" = c("word2"), "year" = rep(100:199), "match_count" = runif(100 ,min = 1000 , max = 2000))) mymethod <- list( "top.points", dl.move("word1", hjust=-.5, vjust=19.5), dl.move("word2", hjust =-4.4, vjust=15.5) ) ggplot(df.2, aes(year, match_count, group=n_gram, color=n_gram)) + geom_line(alpha = I(7/10), color="grey", show_guide=F) + xlim(c(100,220))+ stat_smooth(size=2, span=0.3, se=F, show_guide=F) + geom_dl(aes(label=n_gram), method = mymethod, show_guide=F)

273 | P a g e

LATEX Prepare tables for LATEX xtable(table, caption=NULL, label=NULL, align=NULL, digits=3,display=NULL) Info on R to Latex http://stackoverflow.com/questions/2978784/suggestion-for-r-latex-table-creation-package

274 | P a g e

Вам также может понравиться