Вы находитесь на странице: 1из 10

Introduction to R

In this practical we will learn about What R is and its advantages over other statistics packages The basics of R Plotting graphs Minimization and integration Discrete random variables in R

1. What is R?
R is one of many computer programs designed to carry out statistical analyses. Some of the advantages of R are: Its free! It is developed by an international team of statistical computing experts. It is becoming the computer program of choice for statistical research. All standard statistical analyses are implemented. It is actually a complete programming language and so we can extend it in any way we choose. It allows us to production publication quality graphics Add-on packages are available in a diverse range of specialized fields, e.g. Microarray Analysis, brain imaging etc. There is an extensive help system and an active email help list. It is available on Windows, Linux, Unix, and Macintosh operating systems.

2. Books
In addition to the extensive documentation and help system that is included in R there are two main books recommended: Introductory Statistics with R by Peter Dalgaard, ISBN 0-387-95475-9 a very good introduction to R that includes many biostatistical examples

Modern Applied Statistics with S by Bill Venables and Brian Riplay, ISBN 0-387-95457-0 a comprehensive text that details the S-PLUS and R implementation of many statistical methods using real datasets.

3. Getting started To start and R session from Windows simple double click on the R icon or select R from the program list on the Start menu. This will produce an R console window. R works by a question and answer mode: you enter a command, press ENTER and R carries out the command and prints the results to the screen if required. For example, if we want to know the answer to 2+2 we would simply enter 2+2 and press Enter. This should produce > 2+2 [1] 4 > Basic Calculations R can be thought of as an overgrown calculator and as suck we do all of things we can do on a standard calculator. > 5*4 [1] 20 > exp(-2) [1] 0.1353353 > sqrt(12) [1] 3.464102 > 3^5 [1] 243 > Assigning Variables Often we will want to store the results of a command. To do so we assign the result to a variable with a name of our choice. > a = (3*7) + 1 >a [1] 22 We can then manipulate the variable in any way we wish, e.g. > a^3 [1] 10648 > Vectors Often we need to work with vectors of numbers. We can assign vectors in several ways. Vectors of consecutive integers can be assigned using > v1 = 11:15 > v1 [1] 11 12 13 14 15 > 2

More generally, sequences with set intervals can be assigned using the seq() function > v2 = seq(11, 15, by=0.5) > v2 [1] 11.0 11.5 12.0 12.5 13.0 13.5 14.0 14.5 15.0 > Alternatively, we can assign the vector directly using > v3 = c(11, 12, 13, 14, 15) > v3 [1] 11 12 13 14 15 Vectors can be manipulated in various different ways > v3^2 + 6 [1] 127 150 175 202 231 > v3 > 13.5 [1] FALSE FALSE FALSE TRUE TRUE Specific elements or subsets of elements of the vector can extracted > v3[4] [1] 14 > v3[1:3] [1] 11 12 13 > v3[c(1,3,5)] [1] 11 13 15 Vectors can also contain strings > v4 = c("tree", "apple", "pear", "ball", "sky") > v4 [1] "tree" "apple" "pear" "ball" "sky" Matrices Matrices can be constructed from vectors. (Everything written after # is a comment and will not be interpreted. > m1 = 1:12 > dim(m1) = c(3, 4) #number of rows = 3 and number of columns=4 > m1 [,1] [,2] [,3] [,4] [1,] 1 4 7 10 [2,] 2 5 8 11 [3,] 3 6 9 12 Or using the matrix() function > m2 = matrix(1:12, nrow=3, ncol=4, byrow=TRUE) > m2 [,1] [,2] [,3] [,4] [1,] 1 2 3 4 [2,] 5 6 7 8 [3,] 9 10 11 12 3

Like vectors, matrices can be manipulated in various ways > m2^2 +3 [,1] [,2] [,3] [,4] [1,] 4 7 12 19 [2,] 28 39 52 67 [3,] 84 103 124 147 > m2[2,3] [1] 7 The function solve() can be used to solve linear systems of equations of the form AX = B where A and B are known > A = matrix(1:4, 2, 2) > B = c(4, 5) > solve(A, B) [1] -0.5 1.5 With just one argument solve() inverts a matrix > solve(A) [,1] [,2] [1,] -2 1.5 [2,] 1 -0.5 Lists Often we would like to store data of several different types and sizes in one object. This can be achieved using a list. > l1 = list(v1=v1, v4=v4, m1=m1) > l1 $v1 [1] 11 12 13 14 15 $v4 [1] "tree" "apple" "pear" "ball" "sky" $m1 [,1] [,2] [,3] [,4] [1,] 1 4 7 10 [2,] 2 5 8 11 [3,] 3 6 9 12 Components of a list can be accessed by name using the $ symbol > l1$v1 [1] 11 12 13 14 15 Or by position

> l1[1] $v1 [1] 11 12 13 14 15 ls() The variables, vectors, matrices and lists we have assigned so far will be stored until the end of the R session. We can look at which objects we have created using the function ls() > ls() [1] "a" "A" "B" "l1" "m1" "m2" "v1" "v2" "v3" "v4" Functions So far we have used the functions exp, sqrt, seq, c, dim, matrix, solve and ls. The statistical functionality of R is accesses through functions so we need to be familiar with their use. The format of a function call is the function name followed by a set of parentheses containing one or more arguments. The ? symbol can be used to get a description of any given function. For example, < ?matrix Will produce a description of matrix() in a separate window. The most precise way to specify the arguments to a function is by name. For example, > matrix(data= 1:4, nrow=2, ncol=2, byrow=FALSE) [,1] [,2] [1,] 1 3 [2,] 2 4 Alternatively arguments can be specified by position if we know the form of the function. > matrix(1:4, 2, 2) [,1] [,2] [1,] 1 3 [2,] 2 4 Many function arguments have sensible default settings and thus can be ignored in standard function calls. For example, the second call of matrix() above did not include the byrow argument as the default is FALSE. Oftern we will want to do something R cannot do directly so we can write our own functions. For example, we might want to evaluate the quadratic x2 -2x + 4 many times so we can write a function that evaluates the function for a specific value of x. > my.f = function(x) {x^2 -2 *x + 4} > my.f(3) [1] 7

The R help system In addition to the ? command R has extensive documentation and a user friendly help system. These can be accessed by > help.start() which should open a separate window. From this window you will be able to browse the manuals and search for functions.

4. Reading in your own data


R has a very useful function called read.table for reading data into a session. Use a text editor (like notepad or Tinn-R) to create a file called data.txt which should look something like this 1 34.6 87.2 2 65.3 76.2 3 76.6 71.9 4 42.0 9.01 5 45.3 87.1 Read the data into R using the command > a = read.table(file="path-to-file") >a V1 V2 V3 1 1 34.6 87.20 2 2 65.3 76.20 3 3 76.6 71.90 4 4 42.0 9.01 5 5 45.3 87.10 You can specify row and column names for the dataset using > colnames(a)=c("X", "Y", "Z") > rownames(a)=c("a", "b", "c", "d", "e") >a X Y Z a 1 34.6 87.20 b 2 65.3 76.20 c 3 76.6 71.90 d 4 42.0 9.01 e 5 45.3 87.10

5. Graphics
One of the most important parts of any statistical analysis is the graphical exploration and presentation of the data and results. R has excellent graphics functionality that allows us to produce publication quality plots. First we need some data to plot! There are many useful biostatistical datasets contained in the package ISwR which accompanies the book Introductory Statistics with R by Peter Dalgaard. This package can be loaded using > library(ISwR) If the package is not installed, it should be downloaded from R website and saved. Then you click on Packages > Install package(s) from local zip file Once the package is loaded you can import the data by typing > data(cystfibr) The function plot() can be used to plot two variables from the dataset against each other > plot(cystfibr$weight, cystfibr$tlc)

Figure 1: A plot of weight versus total lung capacity for the cystfibr dataset.

The R function curve() can be used to plot a given function or expression over a given range. > curve(my.f(x), from = -10, to = 10) If we have another function that we want to plot then we can add to the existing plot using the argument add = TRUE

Figure 2: Plotting mathematical functions using curve().

The functions hist() and boxplot() produce histograms and boxplots respectively. Lets suppose we want to plot a histogram and a boxplot next to each other in the same graphics window. This can be done using the function par() which can be used to set many graphical parameters. > par(mfrow = c(1,2)) divides the graphics window into a 1 x 2 grid. Subsequent figures will be drawn in the grid by row. 1

Figure 3: A histogram and boxplot in the same window.

To plot by column use mfcol

6. Minimization and integration


R has two functions optimize() and integrate() that can be used to numerically minimize and integrate mathematical functions over a given ranges. For example, consider our quadratic x2 - 2x + 4 that we have programmed into our function my.f. We can find the minimum of the function using > optimize(my.f, lower=-10, upper=10) $minimum [1] 1 $objective [1] 3 which says that the minimum occurs at x = 1 and at that point the quadratic has value 3. We can integrate the function over the interval[-10, 10] using > integrate(my.f, lower = -10, upper= 10) 746.6667 with absolute error < 8.3e-12 which gives an answer together with an estimate of the absolute error (in this case very small).

7. Discrete Random Variables


R has functions that can simulate pseudo-random variables from nearly all the standard discrete distributions. For example, the functions rbinom() and rpois() generate binomial and poisson random variables respectively. > rbinom(n = 10, size = 3, prob = 0.5) [1] 0 2 1 1 3 3 3 1 2 1 > rpois(n = 10, lambda = 2) [1] 3 2 0 2 2 1 3 3 1 1 We can use these functions to approximate specific probabilities by simulation. For example, if X ~ Bin(3, 0.5) then P(X=2) can be approximated by > x1 = rbinom(10000, 3, 0.5) > px = sum(x1 == 2)/10000 > px [1] 0.3747 Similarly, if Y ~ Po(2) then P(Y = 3) can be approximated by > y1 = rpois(10000, 2) > py = sum(y1 == 3) / 10000 > py [1] 0.1782 The functions dbinom() and dpois() calculate the probability of a specific discrete value from both distributions exactly. For example, if X ~ Bin(3, 0.5) then P(X = 2) is given by > dbinom(2, 3, 0.5) [1] 0.375 9

Similarly, if Y ~ Po(2) then P(Y = 3) is given by > dpois(3, 2) [1] 0.1804470

10

Вам также может понравиться