Вы находитесь на странице: 1из 8

ORF 245. Fundamentals of Engineering Statistics S ebastien Bubeck Spring 2014.

Precept # 1

For this rst precept, we will make sure that R is properly installed on your laptop and review the basic manipulations. The goal is not to show all the features of R, but rather to illustrate the features of R that can be learned in a one-semester, introductory statistics course. As described on the R project web page: R is a free software environment for statistical computing and graphics. It compiles and runs on a wide variety of UNIX platforms, Windows and MacOS. It consists of a language plus a run-time environment with graphics, a debugger, access to certain system functions, and the ability to run programs stored in script les. The core of R is an interpreted computer language which allows branching and looping as well as modular programming using functions. It is currently maintained by the R core-development team, a hard-working, international team of volunteer developers. The R project web page is http://www.r-project.org.

Installation

MacOS Download the le R-2.15.1-signed.pkg from http://cran.us.r-project.org/bin/macosx. Run R-2.15.1signed.pkg by double-clicking on it. R will be installed to your Applications folder. Win Download the le R-2.15.1-win32.exe from http://cran.us.r-project.org/bin/windows/base. Run R-2.15.1-win32.exe by double-clicking on it. The installation program will create the directory c:\Program Files\R\R-2.15.1 where R-2.15.1 may vary, according to the version of R that you have installed. The actual R program will be c:\Program Files\R\R-2.15.1\bin\Rgui.exe. A windows shortcut should have been created on the desktop and/or in the start menu. Unix Download the le R-2.15.1.tar.gz to a new directory and then execute the following command: mkdir -p ~/.R/libs/. To inform R where to look for the libraries that you installed you should create a le called .Renviron in your home directory containing the line: R_LIBS=~/.R/libs/. This can be done using the following command: echo R_LIBS=~/.R/libs/>>~/.Renviron This sets the environment variable R_LIBS whenever you start R and adds the path to the list of paths visible by the R command .libPaths() that can be executed when inside R. R is most easily used in an interactive manner. You ask a question and R gives you an answer. Questions are asked and answered on the command line. To start up Rs command line you can do the following: on Mac double click on the R icon in Application folder, on Unix, from the command line type R, in Windows nd the R icon and double click. Once R is started, you should be greeted with a command similar to R version 2.15.1 (2012-06-22) -- "Roasted Marshmallows" Copyright (C) 2012 The R Foundation for Statistical Computing ISBN 3-900051-07-0 Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit)

R is free software and comes with ABSOLUTELY NO WARRANTY. You are welcome to redistribute it under certain conditions. Type license() or licence() for distribution details. Natural language support but running in an English locale R is a collaborative project with many contributors. Type contributors() for more information and citation() on how to cite R or R packages in publications. Type demo() for some demos, help() for on-line help, or help.start() for an HTML browser interface to help. Type q() to quit R. [R.app GUI 1.52 (6188) x86_64-apple-darwin9.8.0] > The > is called the prompt. This is where you enter your commands and follow by pressing the RETURN key. Try these simple examples > 2+2 [1] 4 > x=4 > x [1] 4 > x+2 [1] 6

Getting Help

R has a builtin help facility similar to the man facility of UNIX. To get more information on any specic named function, for example solve, the command is > help(solve) > ? solve On most R installations help is available in HTML format by running help.start() which will launch a Web browser that allows the help pages to be browsed with hyperlinks. On UNIX, subsequent help requests are sent to the HTML-based help system. The Search Engine and Keywords link in the page loaded by help.start() is particularly useful as it is contains a high-level concept list which searches though available functions. On Mac and Windows installation, there is also a search eld.

Data creation and extraction

R operates on named data structures. The simplest such structure is the numeric vector, which is a single entity consisting of an ordered collection of numbers. The most useful R command for quickly entering in small data sets is the c() function. This function can take an arbitrary number of vector arguments and its value is a vector got by concatenating its arguments end to end. For example to enter the number of A+ grades per semester you type 2

> Agrades=c(1,0,2,3,1,0,2,1) > Agrades [1] 1 0 2 3 1 0 2 1 > Agrades[4] [1] 3 Here we have assigned the values to a vector Agrades. The value of the vector Agrades doesnt automatically print out. It does when we type just its name. The value of Agrades is prefaced with a [1]. This indicates that the value is a vector of size 1. The data is stored in R as a vector. This means simply that it keeps track of the order that the data is entered in. In particular there is a rst element, a second element up to a last element. To extract the fourth element of a vector Agrades we used the command Agrades[4]. 3.1 Vector Manipulation

Lets set the example how you can multiply vectors together. By default in R the multiplication is done elementwise and not vectorwise (as for example in MATLAB). > Agrades1=c(1,0,2,3,1,0,2,1) > Agrades2=c(2,1,2,3,0,3,2,2) > Agrades1*Agrades2 [1] 2 0 4 9 0 0 4 2 > t(Agrades1)%*%Agrades2 [,1] [1,] 21 Hence we can see that * denotes elementwise i.e. element by element multiplication where %*% does the vector multiplication What function t() does, it takes the transpose of the vector or matrix passed to it. The symbols [1,] and [,1] stand do denote the rst row i.e. column of the resulting matrix. The elementary arithmetic operators are the usual +, -, *, / and ^ for raising to a power. In addition all of the common arithmetic functions are available: log, exp, sin, cos, tan, sqrt for logarithm, exponential, sine, cosine, tangent and square root, respectively. Another way of adding elements into a vector is sequential statement of the form > n=length(Agrades1) #length of the vector Agrades1 > Semesters=1:n > Semesters [1] 1 2 3 4 5 6 7 8 Here the command =1:n was used to assign all the integer values between 1 and n to the vector Semesters. The command a:b is simply a, a+1, a+2, ..., b if a,b are integers and intuitively dened if not. Note the size of one vector can be accessed through the function length() and sign # stands for comments. The function seq() is a more general facility for generating sequences. It has ve arguments, only some of which may be specied in any one call. The rst two arguments, if given, specify the beginning and end of the sequence and the third parameter specify the distance among each element of the sequence or the length of the wanted sequence: > x=seq(from=1, to=4, by=0.2) [1] 1.0 1.2 1.4 1.6 1.8 2.0 2.2 2.4 2.6 2.8 3.0 3.2 3.4 3.6 3.8 4.0 > x=seq(from=1, to=4, length=8) [1] 1.000000 1.428571 1.857143 2.285714 2.714286 3.142857 3.571429 4.000000 3

As well as numerical vectors, R allows manipulation of logical quantities. The elements of a logical vector can have the values TRUE, FALSE, and NA (for not available). Logical vectors are generated by conditions. The logical operators are <, <=, >, >=, == for exact equality and != for inequality. In addition if c2 and c2 are logical expressions, then c1 & c2 is their intersection (and), c1|c2 is their union (or), and !c1 is the negation of c1. >Agrades==1 [1] TRUE FALSE FALSE FALSE TRUE FALSE FALSE TRUE

where we have used the command Agrades==1 to see which elements of this vector are equal to 1. Subsets of the elements of a vector may be selected by appending to the name of the vector an index vector in square brackets. The use of extracting elements of a vector using another vector of the same size which is comprised of TRUEs and FALSEs is referred to as extraction by a logical vector. The command Semesters[Agrades1==2] gives back which semesters had exactly 2 students with grades A+. > Semesters[Agrades1==2] # logical extraction [1] 3 7

Here is the summary of data extraction commands for vectors: how many elements? i-th element all but ith element rst k elements last k elements specic elements all greater than some value bigger than or less than some values which indices are largest length(x) x[i] x[-i] x[1:k] x[(length(x)-k):length(x)] x[c(1,3,5)] (First, 3rd and 5th) x[x>3] (the value is 3) x[x< -2 | x > 2] which(x == max(x))

3.2

Matrix Manipulation

Vectors are the most important type of object in R, but there a few others that we will meet. Matrices are one of them. Matrices or more generally arrays are multi-dimensional generalizations of vectors. In fact, they are vectors that can be indexed by two or more indices and will be printed in special ways. Function matrix() is used to assign values to small matrices or to declare an object to be a matrix. Function array() can be used to sequentially assign values to matrices. > M=matrix(3,2,5) # 2 by 5 matrix of all elements being 3 > M [,1] [,2] [,3] [,4] [,5] [1,] 3 3 3 3 3 [2,] 3 3 3 3 3 > N=array(1:20, dim=c(4,5)) # Generate a 4 by 5 array > N 4

[1,] [2,] [3,] [4,]

[,1] [,2] [,3] [,4] [,5] 1 5 9 13 17 2 6 10 14 18 3 7 11 15 19 4 8 12 16 20

Extraction of elements of a matrix is done similarly to a vector one. Negative indices are not allowed in index matrices. NA and zero values are allowed: rows in the index matrix containing a zero are ignored, and rows containing an NA produce an NA in the result. > N[2,3] # extracting the element of a matrix [1] 10 > i <- array(c(1:3,3:1), dim=c(3,2)) > i # i is a 3 by 2 index array. [,1] [,2] [1,] 1 3 [2,] 2 2 [3,] 3 1 > N[i] # Extract those elements [1] 9 6 3 > N[i] <- 0 # Replace those elements by zeros. > N [,1] [,2] [,3] [,4] [,5] [1,] 1 5 0 13 17 [2,] 2 0 10 14 18 [3,] 0 7 11 15 19 [4,] 4 8 12 16 20 > Similarly as with vectors all arithmetic operations are done element by element +,-,%,^. Matrix multiplication can be done with command %*%. Transpose of a matrix is done by using function t. Here is the summary of data extraction commands for vectors: dimension of a matrix? (i,j)-th element rst k elements of row i last k elements of row i i-th row j-th column specic elements all greater than some value dim(x) x[i,j] x[i,1:k] x[i,(dim(x)[2]-k):dim(x)[2]] x[i,] x[,j] x[array(c(1:3,3:1), dim=c(3,2))] x[x>3] (the result is a vector)

3.3

Lists/Data Frames

An R list is an object consisting of an ordered collection of objects known as its components. There is no particular need for the components to be of the same mode or type, and, for example, a list could consist of a numeric vector, a logical value, a matrix, a complex vector, a character array, a function, and so on. > Lst = list(name="Fred", wife="Mary", no.children=3,child.ages=c(4,7,9)) > Lst $name [1] "Fred" $wife [1] "Mary" $no.children [1] 3 $child.ages [1] 4 7 9 Components are always numbered and may always be referred to as such. Thus if Lst is the name of a list with four components, these may be individually referred to as Lst[[1]],Lst[[2]], Lst[[3]] and Lst[[4]]. Components of lists may also be named and extracted through Lst$name,Lst$wife, Lst$no.children, Lst$child.ages. A data frame is a list with class "data.frame" with the restriction that the components must be vectors (numeric, character, or logical), factors, numeric matrices, lists, or other data frames. Vector structures appearing as variables of the data frame must all have the same length, and matrix structures must all have the same row size. accountants <- data.frame(home=statef, loot=incomes, shot=incomef) The simplest way to construct a data frame from scratch is to use the read.table() function to read an entire data frame from an external le. The $ notation, such as accountants$statef, for list components is not always very convenient. A useful facility would be somehow to make the components of a list or data frame temporarily visible as variables under their component name, without the need to quote the list name explicitly each time. The attach() function takes a database such as a list or data frame as its argument. Thus suppose lentils is a data frame with three variables lentils$u, lentils$v, lentils$w. The attach attach(lentils) places the data frame in the search path at position 2, and provided there are no variables u, v or w in position 1, u, v and w are available as variables from the data frame in their own right. To detach a data frame, use the function detach(). attach() is a generic function that allows not only directories and data frames to be attached to the search path, but other classes of object as well. In particular any object of mode list may be attached in the same way:attach(any.old.list) where any.old.list can be found under R->Packages & Data->Data Manager.

R Commands

Technically R is an expression language with a very simple syntax. It is case sensitive as are most UNIX based packages, so A and a are dierent symbols and would refer to dierent variables. Normally all alphanumeric symbols are allowed (and in some countries this includes accented letters) plus . and _, with the restriction that a name must start with . or a letter, and if it starts with . the second character must not be a digit.

If commands are stored in an external le, say commands.R in the working directory work, they may be executed at any time in an R session with the command source("commands.R") For Windows Source is also available on the File menu. The entities that R creates and manipulates are known as objects. These may be variables, arrays of numbers, character strings, functions, or more general structures built from such components. The R command objects() (alternatively, ls()) can be used to display the names of (most of) the objects which are currently stored within R. The collection of objects currently stored is called the workspace. To remove objects the function rm is available: rm(x, y, z, ink, junk, temp, foo, bar). The language has available a conditional construction of the form if (expr_1) expr_2 else expr_3 where expr 1 must evaluate to a single logical value and the result of the entire expression is then evident. There is a vectorized version of the if/else construct, the ifelse function. This has the form ifelse(condition, a, b) and returns a vector of the length of its longest argument, with elements a[i] if condition [i] is true, otherwise b[i]. There is also a for loop construction which has the form for (name in expr_1) expr_2 where name is the loop variable. expr 1 is a vector expression, (often a sequence like 1:20), and expr 2 is often a grouped expression with its sub-expressions written in terms of the dummy name. expr 2 is repeatedly evaluated as name ranges through the values in the vector result of expr 1. For example for (i in 1:length(y)) { plot(x[i], y[i]) z[i]=x[i]+y[i] } Other looping facilities include the repeat expr statement and the while (condition) expr statement. The R language allows the user to create objects of mode function. These are true R functions that are stored in a special internal form and may be used in further expressions and so on. It should be emphasized that most of the functions supplied as part of the R system, such as mean(), var(), postscript() and so on, are themselves written in R and thus do not dier materially from user written functions. A function is dened by an assignment of the form name <- function(arg_1, arg_2, ...)expression The expression is an R expression, (usually a grouped expression), that uses the arguments, arg_i, to calculate a value. The value of the expression is the value returned for the function. A call to the function then usually takes the form name(expr_1, expr_2, ...) and may occur anywhere a function call is legitimate. Consider a following simple example > twosam = function(y1, y2) { n1 <- length(y1); n2 <- length(y2) yb1 <- mean(y1); yb2 <- mean(y2) s1 <- var(y1); s2 <- var(y2) s <- ((n1-1)*s1 + (n2-1)*s2)/(n1+n2-2) tst <- (yb1 - yb2)/sqrt(s*(1/n1 + 1/n2)) tst }

This function can be later on called as tstat <- twosam(data$male, data$female). In many cases arguments can be given commonly appropriate default values, in which case they may be omitted altogether from the call when the defaults are appropriate. Note that any ordinary assignments done within the function are local and temporary and are lost after exit from the function. Many R functions and datasets are stored in packages. Only when a package is loaded are its contents available. To see which packages are installed at your computer, issue the command library() with no arguments. To see which packages are attached to the current map of your R, use the command search() with no arguments. To load a particular package (e.g., the boot package containing functions from Davison & Hinkley (1997)), use a command like library(boot) Users connected to the Internet can use the install.packages() and update.packages() functions.

Reading Data from Files

To read an entire data frame directly, the external le will normally have a special form. The rst line of the le should have a name for each variable in the data frame. Each additional line of the le has as its rst item a row label and the values for each variable. 01 02 03 04 05 ... Price Floor Area 52.00 111.0 830 54.75 128.0 710 57.50 101.0 1000 57.50 131.0 690 59.75 93.0 900 Rooms 5 5 5 6 5 Age Cent.heat 6.2 no 7.5 no 4.2 no 8.8 no 1.9 yes

The function read.table() can then be used to read the data frame directly HousePrice = read.table("houses.data") Often you will want to omit including the row labels directly and use the default labels: read.table("houses.data", header=TRUE). Around 100 datasets are supplied with R (in package datasets), and others are available in packages (including the recommended packages supplied with R). To see the list of datasets currently available use data(). To access data from a particular package, use the package argument, for example >data(package="rpart") # shows which datasets >data(car.test.frame) # uploads a dateset >head(car.test.frame) Price Country Reliability Mileage Eagle Summit 4 8895 USA 4 33 Ford Escort 4 7402 USA 2 33 Ford Festiva 4 6319 Korea 4 37 Honda Civic 4 6635 Japan/USA 5 32 Mazda Protege 4 6599 Japan 5 32 Mercury Tracer 4 8672 Mexico 4 26 are listed in the package rpart "car.test.frame" Type Weight Disp. HP Small 2560 97 113 Small 2345 114 90 Small 1845 81 63 Small 2260 91 92 Small 2440 113 103 Small 2285 97 82

Lets see what kind of questions can we answer from this dataset: 1 Extract the vector of all the Countries where the data was collected from. 2 What is the price of the car with the smallest mileage and what is with the largest mileage? 3 How many cars have mileage of 22 mpg? 4 Where is the compact car with the smallest mileage from?

Вам также может понравиться