Вы находитесь на странице: 1из 6

ST 540: An Introduction to R

Ryan T. Elmore and Jennifer A. Hoeting August 22, 2007

General R Info
We will use the R statistical software in STAT 540. R is a platform-independent (runs on Windows, Mac, and Unix/Linux), freeware version of S-Plus. You can download R from the R Project website at www.r-project.org for free! In other words, you should have no trouble nding a copy of R to use provided that you can nd a computer. R is installed on most, if not all, community computers in the Statistics Building. There is quite a bit of free documentation available at the R website, or, if you want to spend money, you can nd books on R too. Once you have installed R, it is probably a good idea to create separate working directories for your various courses. For example, you might want to create a directory called STAT540, or something along those lines. It is not a bad idea to create subdirectories within this directory so that you dont mask objects or create other problems. If you go this route, you will have to change your working directory to your STAT540 directory in order for the objects to be saved there. You can change your working directory using menus or by using the setwd() command.

Saving your work


Youll need to save your code for two reasons. First, well reuse a lot of the commands in this class, so youll want to refer to your previous assignments. Second, youll need to turn in a clean version of your nal code with every assignment that requires computing. There are 2 options for saving your work. As described below, you might end up using a combination of both options, depending on your goals for a particular R session. 1. Use Microsoft Word (or your favorite text editor like notepad, vi, emacs, etc. ). a. Open a blank Word document by clicking on the Word icon. b. Write the necessary commands in R to carry out the analyses below and then cut and paste the answer into your Word document. c. Example of output cut and pasted into a Word document: Part a. Create a vector N of normal random variables of length 20 > N =rnorm(20) > N 1

[1] 0.91499114 -0.34835346 [4] 0.92598610 -0.02966679 [7] -0.17697113 0.64280212 [10] -0.60311414 0.54913008 [13] 0.75785810 1.78372173 [16] -1.19859043 -0.73382970 [19] 0.36869011 0.54810743 2. R scripts

-1.03180584 1.47782338 -0.70764661 -2.24526009 -0.30530967 1.11611613

a. Within R, use the menus: File New script b. Type your commands in the script window and run the commands like I showed you in class c. Copy and paste your output from the command window into your script window. 3. Pros and Cons: The advantage of Word is that you can copy and paste plots into Word and can make your output pretty. A lot of students write their entire assignment in Word which makes it look nice (which never hurts your grade!). The advantage of using R scripts is that you can execute the code in R, so you can do things like compile a entire script. I use a combination of R scripts and either Word or LaTex documents. I use the R scripts when doing the computing; then I copy the nal version and any plots into Word. One trick I learned recently: If you copy output from the command line window into your R script and then copy it into Word, the output format is often better. This extra step is a pain, but can be useful if you are copying a large table or other formatted output from R into Word. 4. Saving les in the Weber lab: If you arent done with the assignment before the class is done, then save it to a memory stick or some other device. Forget your memory stick? One option is to email the document to yourself. Saving any les to the computers in the Weber lab is a bad idea because they usually wont be there when you come back (and you may not be able to get the machine that you are currently working on anyway).

The Basics
From this point forward, it is assumed that you know how to open R and can see the prompt. The rst thing you should know is the assignment operator. In order to assign an object a value (scalar, vector, matrix, boolean, numeric, character, etc.) you use the arrow given by <- or the equal sign which is what you would expect =. For example, if you want to assign the value 5 to the variable x, type the commands: > x <- 5 or > x = 5

In either case, if you want to see what you have stored in x, simply type x at the prompt and you will see > x [1] 5 The object x will now be in your working directory for the remainder of your R session, or longer if you save your working directory at the end of your session. When you leave R it will ask you whether you want to save workspace image. If you answer yes, it will save all the objects that you created in this session to your current working directory. If you wish to delete the object x from your working directory, simply type rm(x).

Creating Vectors
There are several ways to create vectors (or matrices, or arrays) in R; however, the most basic is to use the concatenate function c(). For example, > y <- c(6,7,12,99) > y [1] 6 7 12 99 If the vector that you are trying to create is patterned in some sense, it might make more sense to use the rep() or seq() commands. Some uses of these functions include: > rep(4,len=5) [1] 4 4 4 4 4 > rep(y,each=2) [1] 6 6 7 7 12 12 99 99 > seq(-3,3,len=10) [1] -3.0000000 -2.3333333 -1.6666667 -1.0000000 -0.3333333 [7] 1.0000000 1.6666667 2.3333333 3.0000000 > seq(-3,3,by=.5) [1] -3.0 -2.5 -2.0 -1.5 -1.0 -0.5 Note also that > z <- 1:10 will create a vector from 1 to 10 in increments of 1. You can access subsets of a vector by placing the indices of the subset in brackets following the vector. For example, if we want the rst and fourth elements of y, issue the following command > y[c(1,4)] [1] 6 99

0.3333333

0.0

0.5

1.0

1.5

2.0

2.5

3.0

Warning: Do not name objects c or t! You will mask the concatenate and transpose functions, respectively. If you accidentally create an object with one of these names, just remove it using the rm() function. To see a listing of your objects in your current working directory, type ls().

Importing Data
The authors of our textbook provide the data for each problem on the cd which accompanies the book. The data are given in text les, for example filename.txt under one of the subdirectories on the disc. Before we proceed, you need to learn about data frames. According to the R manual, data frames are collections of variables which share many of the properties of matrices and of lists, used as the fundamental data structure by most of Rs modeling software. Youll learn more about data frames as you use them in this class. Youll nd it useful to read the section on lists and data frames in the manual you can obtain using the menus items Help Manuals An Introduction to R. Two of the several useful commands to read in data are the read.table and the scan commands. The read.table command creates a data frame, scan creates a vector. The scan() function will read in the data into R in a line-by-line fashion. One of the best uses of the scan function (along with the matrix()function) is to read it directly into a data frame. This is done in the following way. Note that the slashes are backwards to what you might expect in the rst line below. Some platforms allow you to use the more familar forward slash /. > my.mat <- matrix(scan("..\KNN_data\chap1\CH01PR19.txt"),nc=2,byrow=T) > my.dat <- data.frame(my.mat) > names(my.dat) <- c("my.y","my.x") You can then call the variables by typing my.dat$my.y for example. Alternatively, you can attach it and access the variables in the data frame by their name. This should prevent you from masking variable names in the future, e.g. naming a variable y that has already been created. > attach(my.dat) > my.y[1:10] [1] 3.897 3.885 3.778 2.540 3.028 3.865 2.962 3.961 0.500 3.178 You should detach the data frame once you have nished using the particular data set. Detaching will prevent you from accessing the variables directly, however, the data frame will still exist. > detach(my.dat) > my.y[1:10] Error: object "my.y" not found

Plotting
The graphics and plotting capabilities of R are outstanding. Many pharmaceutical, biotech, insurance, and other companies now use S-Plus or R for their graphics (in addition to SAS). We will generally use the plot() function to generate a 2-D plot. This function is highly customizable when using the functions in the par family of graphical parameters. We will illustrate these functions using the data in my.dat. > attach(my.dat) > plot(my.x,my.y,xlab="My X Variable",ylab="My Y Variable", + main="A Simple Plot") Notice that the gure appears in a separate window. This is a called a graphics device. You can save the gure as a pdf by clicking on File -> Save As when the graphics window is active. Close the window by typing dev.off(). The graphics parameters in par allow the user to change the shape, size, and color of the plotting character, the design of the axes, the box, and a host of other parameters. See ?par for more options. > plot(my.x,my.y,xlab="My X Variable",ylab="My Y Variable", + main="A Simple Plot",pch=8,cex=1.5,col="red") You can add more points or lines to an existing gure by using the points(), lines(), or abline() functions. > points(c(20,30),c(1,1),pch=16,col="blue") > lines(c(20,30),c(1,1),lty=2,col="blue") > abline(2,.05,lty=3,col="brown") Notice that using the plot() function will create a new gure (and eliminate the previous gure) whereas these three functions simply add to an existing plot. In order to open a new graphics window you will need to type windows() or win.graph() on a Windows machine, quartz() on a Mac, or X11() on any platform. This will open an empty graphics window and subsequent gures are plotted there. You can redirect the plotting to the previous window (and back) using the dev.set() command, e.g. dev.set(2) will change it back to the orginal window.

Simple Linear Regression


Recall from class, that the method of Least Squares is used to get the parameter estimates of intercept and slope in the simple linear regression model. We will use the lm() function to create our linear model object in R and then several other functions to examine this object. Let us consider a regression of Y on X from the my.dat data frame. > my.lm <- lm(my.y ~ my.x) This tells R to perform a linear regression of my.y as a simple linear function of my.x. We can examine my.lm using the names(), summary(), and anova() functions. Here are the rst two commands and their output.

> names(my.lm) [1] "coefficients" "residuals" [5] "fitted.values" "assign" [9] "xlevels" "call" > summary(my.lm) Call: lm(formula = my.y ~ my.x) Residuals: Min 1Q -2.74004 -0.33827

"effects" "qr" "terms"

"rank" "df.residual" "model"

Median 0.04062

3Q 0.44064

Max 1.22737

Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 2.11405 0.32089 6.588 1.30e-09 *** my.x 0.03883 0.01277 3.040 0.00292 ** --Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1 Residual standard error: 0.6231 on 118 degrees of freedom Multiple R-Squared: 0.07262, Adjusted R-squared: 0.06476 F-statistic: 9.24 on 1 and 118 DF, p-value: 0.002917 The tted LS regression line is found under the Coefficients: part of the output. For = 2.11405 + 0.03883X . We can add the LS this particular problem, the tting line is Y lines to the plot of my.x versus my.y using the abline() function. > abline(my.lm) Explore other output by examining the other objects in my.lm. For example, type > my.lm$residuals

Help
To get help on any particular function, type ?any particular function or help(any particular function). If you dont actually know the name of a function, you can do a search for possible functions from possible keywords using the help.search("xxx") function. You can also use the menu-based help.

Reference
The Basics of S and S-Plus by Andreas Krause and Melvin Olson.

Вам также может понравиться