Вы находитесь на странице: 1из 125

Data Mining with R:

learning by case studies

Luis Torgo

LIACC-FEP, University of Porto R. Campo Alegre, 823 - 4150 Porto, Portugal email: ltorgo@liacc.up.pt http://www.liacc.up.pt/ltorgo

May 22, 2003

Preface

The main goal of this book is to introduce the reader to the use of R as a tool for performing data mining. R is a freely downloadable 1 language and environment for statistical computing and graphics. Its capabilities and the large set of available packages make this tool an excellent alternative to the existing (and expensive!) data mining tools. One of the key issues in data mining is size. A typical data mining problem involves a large database from where one seeks to extract useful knowledge. In this book we will use MySQL as the core database management system. MySQL

is also freely available 2 for several computer platforms. This means that you

will be able to perform “serious” data mining without having to pay any money

at all. Moreover, we hope to show you that this comes with no compromise in

the quality of the obtained solutions. Expensive tools do not necessarily mean better tools! R together with MySQL form a pair very hard to beat as long as you are willing to spend some time learning how to use them. We think that it is worthwhile, and we hope that you are convinced as well at the end of reading this book.

The goal of this book is not to describe all facets of data mining processes.

Many books exist that cover this area. Instead we propose to introduce the reader to the power of R and data mining by means of several case studies. Obviously, these case studies do not represent all possible data mining problems that one can face in the real world. Moreover, the solutions we describe can not be taken as complete solutions. Our goal is more to introduce the reader

to the world of data mining using R through pratical examples. As such our

analysis of the cases studies has the goal of showing examples of knowledge extraction using R, instead of presenting complete reports of data mining case studies. They should be taken as examples of possible paths in any data mining project and can be used as the basis for developping solutions for the reader’s data mining projects. Still, we have tried to cover a diverse set of problems posing different challenges in terms of size, type of data, goals of analysis and tools that are necessary to carry out this analysis.

We do not assume any prior knowledge about R. Readers that are new to

R

and data mining should be able to follow the case studies. We have tried

to

make the different case studies self-contained in such a way that the reader

can start anywhere in the document. Still, some basic R functionalities are introduced in the first, simpler, case studies, and are not repeated, which means that if you are new to R, then you should at least start with the first case

1 Download it from http://www.R-project.org. 2 Download it from http://www.mysql.com.

iii

iv

studies to get acquainted with R. Moreover, the first chapter provides a very short introduction to R basics, which may facilitate the understanding of the following chapters. We also do not assume any familiarity with data mining or statistical techniques. Brief introductions to different modeling approaches are provided as they are necessary in the case studies. It is not an objective of this book to provide the reader with full information on the technical and theoretical details of these techniques. Our descriptions of these models are given to provide basic understanding on their merits, drawbacks and analysis objectives. Other existing books should be considered if further theoretical insights are required. At the end of some sections we provide “Further readings” pointers for the readers interested in knowing more on the topics. In summary, our target readers are more users of data analysis tools than researchers or developers. Still, we hope the latter also find reading this book useful as a form of entering the “world” of R and data mining. The book is accompanied by a set of freely available R source files that can be obtained at the book Web site 3 . These files include all the code used in the case studies. They facilitate the “do it yourself” philosophy followed in this document. We strongly recommend that readers install R and try the code as they read the book. All data used in the case studies is available at the book Web site as well.

(DRAFT - May 22, 2003)

Contents

Preface

iii

1 Introduction

 

1

1.1 How to read this book?

 

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

2

1.2 A short introduction to R

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

3

 

1.2.1 Starting with R .

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

3

1.2.2 R objects

 

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

5

1.2.3

Vectors

 

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

6

1.2.4

Vectorization

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

8

1.2.5

Factors

 

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

9

1.2.6 Generating sequences

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

11

1.2.7 Indexing .

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

12

1.2.8 Matrices and arrays

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

14

1.2.9

Lists

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

17

1.2.10 Data frames .

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

20

1.2.11 Some useful functions

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

23

1.2.12 Creating new functions

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

25

1.2.13 Managing your sessions

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

28

1.3 A short introduction to MySQL

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

29

2 Predicting Algae Blooms

 

33

2.1 Problem description and objectives

 

33

2.2 Data Description

 

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

34

2.3 Loading the data into R

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

34

2.4 Data Visualization and Summarization

 

35

2.5 Unknown values

 

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

42

 

2.5.1 Removing the observations with unknown values

 

43

2.5.2 Filling in the unknowns with the most frequent values

 

44

2.5.3 Filling in the unknown values by exploring correlations

.

45

2.5.4 Filling in the unknown values by exploring similarities

 
 

between cases .

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

48

2.6 Obtaining prediction models .

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

52

 

2.6.1 Multiple linear regression

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

52

2.6.2 Regression trees

 

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

58

2.7 Model evaluation and selection

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

64

2.8 Predictions for the 7 algae .

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

70

 

2.8.1 Preparing the test data

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

71

2.8.2 Comparing the alternative models

.

.

.

.

.

.

.

.

.

.

.

.

.

71

v

vi

CONTENTS

 

2.8.3

Obtaining the prediction for the test samples .

.

.

.

.

.

.

74

 

2.9

Summary

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

77

3

Predicting Stock Market Returns

 

79

3.1 Problem description and objectives

 

79

3.2 The available data

 

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

80

 

3.2.1 Reading the data from the CSV file

.

.

.

.

.

.

.

.

.

.

.

.

81

3.2.2 Reading the data from a MySQL database

 

82

3.2.3 Getting the data from the Web

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

85

 

3.3 Time series predictions .

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

87

 

3.3.1 Obtaining time series prediction models

.

.

.

.

.

.

.

.

.

.

90

3.3.2 Evaluating time series models

 

96

3.3.3 Model selection .

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

100

 

3.4 From predictions into trading actions

 

103

 

3.4.1 Evaluating trading signals .

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

104

3.4.2 A simulated trader

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

106

 

3.5 Going back to data selection .

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

110

 

3.5.1

Enriching the set of predictor variables

 

110

Bibliography

 

119

(DRAFT - May 22, 2003)

Chapter 1

Introduction

R is a language and an environment for statistical computing. It is similar to the S language developed at AT&T Bell Laboratories by Rick Becker, John Chambers and Allan Wilks. There are versions of R for the Unix, Windows and Mac families of operating systems. Moreover, R runs on different computer architectures like Intel, PowerPC, Alpha systems and Sparc systems. R was initially developed by Ihaka and Gentleman (1996) both from the University of Auckland, New Zealand. The current development of R is carried out by a core team of a dozen people from different institutions around the world. R development takes advantage of a growing community that cooperates in its development due to its open source philosophy. In effect, the source code of every R component is freely available for inspection and/or adaptation. There are many critics to the open source model. Most of them mention the lack of support as the main drawback of open source software. It is certainly not the case with R! There are many excellent documents, books and sites that provide free information on R. Moreover, the excellent R-help mailing list is a source of invaluable advice and information, much better then any amount of money could ever buy! There are also searchable mailing lists archives 1 that you can use before posting a question. Data Mining has to do with the discovery of useful, valid, unexpected and understandable knowledge from data. These general objectives are obviously shared by other disciplines like statistics, machine learning or pattern recogni- tion. One of the most important distinguishing issues in data mining is size. With the advent of computer technology and information systems, the amount of data available for exploration has increased exponentially. This poses difficult challenges to the standard data analysis disciplines: one has to consider issues like computational efficiency, limited memory resources, interfaces to databases, etc. All these issues turn data mining into a highly interdisciplinary subject in- volving tasks not only of typical data analysts but also of people working with databases, data visualization on high dimensions, etc. R has limitations with handling enormous datasets because all computation is carried out in the main memory of the computer. This does not mean that we will not be able to handle these problems. Taking advantage of the highly flexible database interfaces available in R, we will be able to perform data mining

1

Architectures and operating systems on which R runs

What is Data Mining?

2

INTRODUCTION

The MySQL DBMS

on large problems. Being faithful to the Open Source philosophy we will use the

2 . MySQL is also available for a

quite large set of computer platforms and operating systems. Moreover, R has

a package that enables an easy interface to MySQL (package “RMySQL”). In summary, we hope that at the end of reading this book you are convinced that you can do data mining on large problems without having to spend any money at all! That is only possible due to the generous and invaluable contri- bution of lots of scientists that build such wonderful tools as R and MySQL.

excellent MySQL database management system

1.1 How to read this book?

The main spirit behind the book is:

Check the book Web site!

Learn by doing it!

The book is organized as a set of case studies. The “solutions” to these

case studies are obtained using R. All necessary steps to reach the solutions

3 you may get all code included in

the document, as well as all data of the case studies. This should facilitate trying them out by yourself. Ideally, you should read this document beside your

are described. Using the book Web site

computer and try every step as it is presented to you in the document. R code

is shown in the book using the following font,

> R.version

platform

i686-pc-linux-gnu

arch

i686

os

linux-gnu

system

i686,

linux-gnu

status

major

1

minor

7.0

year

2003

month

04

day

16

language

R

R commands are entered at R command prompt, “>”. Whenever you see this prompt you may interpret it as R being waiting for you to enter a command. You type in the commands at the prompt and then press the enter key to ask R to execute them. This usually produces some form of output (the result of the command) and then a new prompt appears. At the prompt you may use the up arrow key to browse to previously entered commands. This is handy when you want to type commands similar to what you have done before as you avoid typing them again. Still, you may take advantage of the code provided at the book Web site to cut and paste between your browser and the R console, thus avoiding having to type all commands described in the book. This will surely facility your learning, and improve your understanding of its potential.

(DRAFT - May 22, 2003)

1.2 A short introduction to R

3

1.2 A short introduction to R

The goal of this section is to provide a brief introduction to the key issues of the R language. We do not assume any familiarity with computer programming. Every reader should be able to follow the examples presented on this section. Still, if you feel some lack of motivation for continuing reading this introductory material do not worry. You may proceed to the case studies and then return to this introduction as you get more motivated by the concrete applications.

R is a functional language for statistical computation and graphics. It can

be seen as a dialect of the S language (developed at AT&T) for which John Chambers was awarded the 1998 Association for Computing Machinery (ACM) Software award which mentioned that this language “forever altered how people

analyze, visualize and manipulate data”.

R can be quite useful just by using it in an interactive fashion. Still more

advanced uses of the system will lead the user to develop his own functions to systematize repetitive tasks, or even to add or change some functionalities of the existing add-on packages, taking advantage of being open source.

1.2.1 Starting with R

In order to install R in your system the easiest way is to obtain a binary distri-

CRAN

option from this menu. This option will present a list of the packages

Downloading R

bution from the R Web site 4 where you may follow the link that takes you to the CRAN (Comprehensive R Archive Network) site to obtain, among other things, the binary distribution for your particular operating system/architecture. If you prefer to build R directly from the sources you may get instructions on how to do it from CRAN. After downloading the binary distribution for your operating system you just need to follow the instructions that come with it. In the case of the Windows

Installing R

version, you simply execute the downloaded file (rw1061.exe) 5 and select the options you want in the following menus. In the case of Unix-like operating systems you should contact your system administrator to fulfill the installation task as you will probably not have permissions for that. To run R in Windows you simply double click the appropriate icon on your

Starting R

desktop, while in Unix versions you should type R at the operating system prompt. Both will bring up the R console with its prompt “>”. If you want to quit R you may issue the command q() at the prompt. You

Quitting R

will be asked if you want to save the current workspace. You should answer yes only if you want to resume your current analysis at the point you are leaving it, later on. Although the set of tools that comes with R is by itself quite powerful, it is only natural that you will end up wanting to install some of the large (and

Installing add-on

growing) set of add-on packages available for R at CRAN. In the Windows version this is easily done through the “Packages” menu. After connecting your computer to the Internet you should select the “Install package from

packages

available at CRAN. You select the one(s) you want and R will download the package(s) and self-install it(them) on your system. In Unix versions things

5 The actual name of the file may change with newer versions. This is the name for version

1.6.1.

(DRAFT - May 22, 2003)

4

INTRODUCTION

are slightly different as R is a console program without any menus. Still the operation is simple. Suppose you want to download the package that provides functions to connect to MySQL databases. This package name is RMySQL 6 . You just need to type the following two commands at R prompt:

> options(CRAN=’http://cran.r-project.org’)

> install.package(‘‘RMySQL’’)

The first instruction sets the option that determines the site from where the packages will be downloaded. Actually, this instruction is not necessary as this is the default value of the CRAN option. Still, you may use such an instruction for selecting a nearest CRAN mirror 7 . The second instruction performs the actual downloading and installation of the package 8 . If you want to known the packages currently installed in your distribution you may issue,

> installed.packages()

This produces a long output with each line containing a package, its version information, the packages it depends on, and so on. Another useful command is the following, which allows you to check whether there are newer versions of your installed packages at CRAN,

> old.packages()

Moreover, you may use the following command to update all your installed packages 9 ,

> update.packages()

Getting help in R R has an integrated help system that you can use to know more about the system and its functionalities. Moreover, you can find extra documentation at the R site (http://www.r-project.org). R comes with a set of HTML files that can be read using a Web browser. On Windows versions of R these pages are ac- cessible through the help menu. Alternatively, you may issue “help.start()” at the prompt to launch the HTML help pages. Another form of getting help is to use the help() function. For instance, if you want some help on the plot() function you can enter the command “help(plot)” (or in alternative ?plot). In this case, if the HTML viewer is running the help will be shown there, otherwise it will appear in the R console as text information.

6 You can get an idea of the functionalities of each of the R packages in the R FAQ (frequently asked questions) at CRAN. 7 The list of available mirrors can be found at http://cran.r-project.org/mirrors.html. 8 Please notice that to carry out these tasks on Unix systems you will most surely need to have root permissions, so the best is to ask you system administrator to do the installation. Still, it is also possible to download and install the packages on your personal home directory (consult the R help facilites to check how). 9 You need root permissions in Linux distributions to do this.

(DRAFT - May 22, 2003)

1.2 A short introduction to R

5

1.2.2 R objects

R is an object-oriented language. All variables, data, functions, etc. are stored in the memory of the computer in the form of named objects. By simply typing the name of an object at the R prompt one can see its contents. For example, if you have a data object named x with the value 945, typing its name at the prompt would show you its value,

> x

[1]

945

The rather cryptic “[1]” in front of the number 945 can be read as “this line is showing values starting from the first element of the object”. This is particularly useful for objects containing several values like vectors, as we will see later. Values may be assigned to objects using the assignment operator. This consists of either an angle bracket followed by a minus sign (<-), or a minus sign followed by an angle bracket (->). Below you may find two alternative and equivalent forms of assigning a value to the object y 10 ,

( -> ). Below you may find two alternative and equivalent forms of assigning a value

The assignment

operator

>

y

<-

39

>

y

[1]

39

>

43

->

y

>

y

[1]

43

You may also assign numerical expressions to an object. In this case the object will store the result of the expression,

> z

<-

5

> w

<-

z^2

> w

[1]

25

> +

<-

i

(z*2

45)/2

> i

[1]

27.5

Whenever, you want to assign an expression to an object and then printout the result (as in the previous small examples), you may alternatively surround the assignment statement in parentheses:

>

(w

[1]

<-

25

z^2)

You do not need to assign the result of an expression to an object. In effect, you may use R prompt as a kind of calculator,

>

(34

+

[1]

9.92

90)/12.5

(DRAFT - May 22, 2003)

6

INTRODUCTION

Every object you create will stay in the computer memory until you delete it. You may list the objects currently in the memory by issuing the ls() or objects() commands at the prompt. If you do not need an object you may free some memory space by removing it,

> ls()

"i"

> rm(y)

> rm(z,w,i)

"w"

[1]

"y"

"z"

Listing and deleting objects

Valid object names Object names may consist of any upper and lower-case letters, the digits 0-9 (except in the beginning of the name), and also the period, “.”, which behaves like a letter but may not appear at the beginning of the name (as digits). Note that names in R are case sensitive, meaning that Color and color are two distinct objects. Important Note: In R you are not allowed to use the underscore character

(“

”) in object names. 11

1.2.3

Vectors

The most basic data object in R is a vector. Even when you assign a single

Types of vectors

number to an object (like in x <- 45.3) you are creating a vector containing a single element. All data objects have a mode and a length. The mode determines the kind of data stored in the object. It can take the values character, logical, numeric or complex. Thus you may have vectors of characters, logical values

(T

or F or FALSE or TRUE) 12 , numbers, and complex numbers. The length of an

object is the number of elements in it, and can be obtained with the function length().

Creating vectors

Most of the times you will be using vectors with length larger than 1. You may create a vector in R , using the c() function,

> <-

v

c(4,7,23.5,76.2,80)

 

> v

 

[1]

4.0

7.0

23.5

76.2

80.0

>

length(v)

 

[1]

5

>

mode(v)

 

[1]

"numeric"

 

Type coercion

All elements of a vector must belong to the same mode. If that is not true R will force it by type coercion. The following is an example of this,

> c(4,7,23.5,76.2,80,"rrt")

v

<-

 

> v

[1]

"4"

"7"

"23.5"

"76.2"

"80"

"rrt"

10 Notice how the assignment is a destructive operation (previous stored values in an object are discarded by new assignments). 11 This is a common cause of frustration for experienced programmers as this is a character commonly used in other languages. 12 Recall that R is case-sensitive, thus, for instance, True is not a valid logical value.

(DRAFT - May 22, 2003)

1.2 A short introduction to R

7

All elements of the vector have been converted to character mode. Character values are strings of characters surrounded by either single or double quotes. All vectors may contain a special value named NA. This represents a missing value,

Missing values

> <-

v

c(NA,"rrr")

 

> v

[1]

NA

"rrr"

 

> <-

u

c(4,6,NA,2)

 

> u

[1]

4

6

NA

2

k

> <-

c(T,F,NA,TRUE)

 

> k

[1]

TRUE

FALSE

NA

TRUE

You may access a particular element of a vector through an index,

Accessing individual

elements

> v[2]

[1]

"rrr"

 

You will learn in Section 1.2.7 that we may use vectors of indexes to obtain more powerful indexing schemes. You may also change the value of one particular vector element,

Changing an element

> v[1]

<-

’hello’

> v

 

[1]

"hello"

"rrr"

R allows you to create empty vectors like this,

Empty vectors

>

v

<-

vector()

 

The length of a vector may be changed by simply adding more elements to it

Adding more

using a previously nonexistent index. For instance, after creating empty vector v you could type,

elements

>

v[3]

<-

45

>

v

[1]

NA

NA

45

Notice how the first two elements have an unknown value, NA. To shrink the size of a vector you can use the assignment operation. For instance,

Removing elements

> c(45,243,78,343,445,645,2,44,56,77)

v

<-

 

> v

[1]

45

243

78

343

445

645

2

44

56

77

> c(v[5],v[7])

v

<-

 

> v

[1]

445

2

Through the use of more powerful indexing schemes to be explored in Section 1.2.7 you will be able delete particular elements of a vector in an easier way.

(DRAFT - May 22, 2003)

8

INTRODUCTION

1.2.4

Vectorization

One of the most powerful aspects of the R language is the vectorization of several of its available functions. These functions operate directly on each element of a vector. For instance,

> c(4,7,23.5,76.2,80)

x

> x

[1]

>

v

<-

<-

sqrt(v)

2.000000

2.645751

4.847680

8.729261

8.944272

Vector arithmetic

The function sqrt() calculates the square root of its argument. In this case we have used a vector of numbers as its argument. Vectorization leads the function to produce a vector of the same length, with each element resulting from applying the function to every element of the original vector. You may also use this feature of R to carry out vector arithmetic,

> v1

<-

c(4,6,87)

> v2

<-

c(34,32.4,12)

>

v1+v2

[1]

38.0

38.4

99.0

The recycling rule

What if the vectors do not have the same length? R will use a recycling rule by repeating the shorter vector till it fills in the size of the larger. For example,

> v1

<-

c(4,6,8,24)

> v2

<-

c(10,2)

> v1+v2

[1]

14

8

18

26

It is just as if the vector c(10,2) was c(10,2,10,2). If the lengths are not multiples than a warning is issued,

>

v1

<-

c(4,6,8,24)

> v2

<-

c(10,2,4)

> v1+v2

[1]

14

8

12

34

Warning

message:

longer

object

length

is

not

a

multiple

of

shorter

object

length

in:

v1

+

v2

Still, the recycling rule has been used, and the operation is carried out (it is a warning, not an error!). As mentioned before single numbers are represented in R as vectors of length 1. This is very handy for operations like the one shown below,

>

> 2*v1

v1

<-

[1]

8

c(4,6,8,24)

12

16

48

Notice how the number 2 (actually the vector c(2)!) was recycled, resulting in multiplying all elements of v1 by 2. As we will see, this recycling rule is also applied with other objects, like arrays and matrices.

(DRAFT - May 22, 2003)

1.2 A short introduction to R

9

1.2.5

Factors

Factors provide an easy and compact form of handling categorical (nominal) data. Factors have levels which are the possible values they may take. A factor

is stored internally by R as a numeric vector with values

number of levels of the factor. Factors are particularly useful in datasets where you have nominal variables with a fixed number of possible values. Several graphical and summarization functions that we will explore in later chapters take advantage of this information.

,k, where k is the

Let us see how to create factors in R. Suppose you have a vector with the sex of 10 individuals,

>

g

<-

c(’f’,’m’,’m’,’m’,’f’,’m’,’f’,’m’,’f’,’f’)

>

g

[1]

"f"

"m"

"m"

"m"

"f"

"m"

"f"

"m"

"f"

"f"

You can transform this vector into a factor by entering,

>

>

g

g

[1]

<-

f

Levels:

factor(g)

m

m

f

m

m

f

m

f

m

f

f

Creating a factor

Notice that you do not have a character vector anymore. Actually, as men- tioned above, factors are represented internally as numeric vectors 13 . In this example, we have two levels, ’f’ and ’m’, which are represented internally as 1 and 2, respectively. Suppose you have 4 extra individuals whose sex information you want to store in another factor object. Suppose that they are all males. If you still want the factor object to have the same two levels as object g, you may issue the following,

>

>

other.g

other.g

m

[1]

m

m

Levels:

f

<-

m

m

m

factor(c(’m’,’m’,’m’,’m’,’m’),levels=c(’f’,’m’))

Without the levels argument the factor other.g would have a single level (’m’). One of the many things you can do with factors is to count the occurrence of each possible value. Try this,

>

table(g)

g

f

m

5

5

>

table(other.g)

other.g

f

m

0

5

13 You may confirm it by typing mode(g).

(DRAFT - May 22, 2003)

Frequency tables for factors

10

INTRODUCTION

The table() function can also be used to obtain cross-tabulation of several factors. Suppose that we have in another vector the age category of the 10 individuals stored in vector g. You could cross tabulate these two vectors as follows,

>

a

<-

factor(c(’adult’,’adult’,’juvenile’,’juvenile’,’adult’,’adult’,

+

’adult’,’juvenile’,’adult’,’juvenile’))

>

table(a,g)

 
 

g

a

f

m

 

adult

 

4

2

juvenile

1

3

If you

hit the “return” key before ending some command, R presents a continuation prompt (the “+” sign) for you to complete the instruction. Sometimes we wish to calculate the marginal and relative frequencies for this type of tables. The following gives you the totals for both the sex and the age factors of this data,

Notice how we have introduced a long command in several lines.

>

t

<-

table(a,g)

 

>

margin.table(t,1)

 

a

 

adult

juvenile

 
 

6

4

>

margin.table(t,2)

 

g

f

m

5

5

 

For relative frequencies with respect to each margin and overall we do,

>

prop.table(t,1)

 
 

g

a

f

m

 

adult

 

0.6666667

0.3333333

juvenile

 

0.2500000

0.7500000

>

prop.table(t,2)

 
 

g

a

f

m

 

adult

 

0.8

0.4

juvenile

 

0.2

0.6

>

prop.table(t)

 
 

g

a

f

m

 

adult

 

0.4

0.2

juvenile

 

0.1

0.3

Notice that if we wanted percentages instead we could simply multiply these function calls by 100.

(DRAFT - May 22, 2003)

1.2 A short introduction to R

11

Integer sequences

1.2.6 Generating sequences

R has several facilities to generate different types of sequences. For instance, if you want to create a vector containing the integers between 1 and 1000, you can simply type,

>

which creates a vector x containing 1000 elements, the integers from 1 to 1000. You should be careful with the precedence of the operator “:”. The following examples illustrate this danger,

> 10:15-1

x

<-

1:1000

[1]

9

10

11

>

10:(15-1)

[1]

10

11

12

12

13

13

14

14

Please make sure you understand what happened in the first command (re- member the recycling rule!). You may also generate decreasing sequences like the following,

> 5:0

[1]

5

4

3

2

1

0

To generate sequences of real numbers you can use the function seq(). The instruction

Sequences of real numbers

>

seq(-4,1,0.5)

 

[1]

-4.0

 

-3.5

 

-3.0

-2.5

-2.0

-1.5

-1.0

-0.5

0.0

0.5

1.0

generates a sequence of real numbers between -4 and 1 in increments of 0.5. Here are a few other examples of the use of the function seq() 14 ,

 

>

seq(from=1,to=5,length=4)

 

[1]

1.000000

 

2.333333

3.666667

5.000000

 

>

seq(from=1,to=5,length=2)

 

[1]

1

5

> seq(length=10,from=-2,by=.2)

 
 

[1]

-2.0

 

-1.8

 

-1.6

-1.4

-1.2

-1.0

-0.8

-0.6

-0.4

-0.2

Another very useful function to generate sequences is the function rep(),

Sequences with

> rep(5,10)

 

repeated elements

 

[1]

5

5

5

5

5

5

5

5

5

5

> rep(’hi’,3)

 

[1]

"hi"

 

"hi"

 

"hi"

 

>

rep(1:3,2)

 

[1]

1

2

3

1

2

3

The function gl() can be used to generate sequences involving factors. The syntax of this function is gl(k,n), where k is the number of levels of the factor, and n the number of repetitions of each level. Here are two examples,

Factor sequences

14 You may want to have a look at the help page of the function (typing for instance ’?seq’), to better understand its arguments and variants.

(DRAFT - May 22, 2003)

12

INTRODUCTION

> gl(3,5)

[1]

1

1

1

1

1

2

2

2

2

2

3

3

3

3

3

Levels:

1

2

3

> gl(2,5,labels=c(’female’,’male’))

[1]

female

female

female

Levels:

female

male

female

female

male

male

male

male

male

Finally, R has several functions that can be used to generate random se- Random sequences quences according to a large set of probability density functions. The func-

tions have the generic structure rfunc(n, par1, par2,

the name of the density function, n is the number of data to generate, and

par1, par2,

that may be necessary. For instance, if you want 10 randomly generated num- bers from a normal distribution with zero mean and unit standard deviation, type

are the values of some parameters of the density function

where func is

),

> rnorm(10)

[1] -0.306202028 0.335295844 1.199523068 2.034668704 0.273439339 [6] -0.001529852 1.351941008 1.643033230 -0.927847816 -0.163297158

while if you prefer a mean of 10 and a standard deviation of 3, you should use

>

rnorm(10,mean=10,sd=3)

[1] 7.491544 12.360160 12.879259 5.307659 11.103252 18.431678 9.554603 [8] 9.590276 7.133595 5.498858

To get 5 numbers drawn randomly from a Student t distribution with 10 degrees of freedom, type

> rt(5,df=10)

[1]

-0.46608438

-0.44270650

-0.03921861

0.18618004

2.23085412

R has many more probability functions, as well as other functions for ob- taining the probability densities, the cumulative probability densities and the quantiles of these distributions.

1.2.7

Indexing

We have already seen examples on how to get one element of a vector by in- dicating its position between square brackets. R also allows you to use vectors within the brackets. There are several types of index vectors. Logical index Logical index vectors vectors extract the elements corresponding to true values. Let us see a concrete example.

> <-

x

 

c(0,-3,4,-1,45,90,-5)

 

> x

 

[1]

0

-3

4

-1

45

90

-5

>

x

>

0

[1]

FALSE

FALSE

TRUE

FALSE

TRUE

TRUE

FALSE

> y

<-

x>0

The third instruction of the code shown above is a logical condition. As x is a vector, the comparison is carried out for all elements of the vector (remember

(DRAFT - May 22, 2003)

1.2 A short introduction to R

13

the famous recycling rule!), thus producing a vector with as many logical values as there are elements in x. We then store this logical vector in object y. You can now obtain the positive elements in x using this vector y as a logical index vector,

> x[y]

[1]

4

45

90

As the truth elements of vector y are in the 3rd, 5th and 6th positions, this corresponds to extracting these elements from x. Incidentally, you could achieve the same result by just issuing,

> x[x>0]

[1]

4

45

90

Taking advantage of the logical operators available in R you may use more complex logical index vectors, as for instance,

>

x[x

<=

-2

|

x

>

5]

[1]

-3

45

90

-5

>

x[x

>

40

&

 

x

<

100]

[1]

45

90

As you may have guessed, the “|” operator performs logical disjunction, while the “&” operator is used for logical conjunction. This means that the first instruction shows us the elements of x that are either less or equal to -2, or greater than 5. The second example presents the elements of x that are both greater than 40 and less than 100. R also allows you to use a vector of integers to extract elements from a vector. The numbers in this vector indicate the positions in the original vector to be extracted,

> x[c(4,6)]

Integer index vectors

[1]

-1

90

> x[1:3]

 

[1]

0

-3

4

Alternatively, you may use a vector with negative indexes, to indicate which

Negative integer

elements are to be excluded from the selection,

index vectors

>

x[-1]

 

[1]

-3

4

-1

45

90

-5

>

x[-c(4,6)]

 

[1]

0

-3

4

45

-5

>

x[-(1:3)]

 

[1]

-1

45

90

-5

Note the need for parentheses in the last example due to the precedence of the “:” operator. Indexes may also be formed by a vector of character strings taking advantage

Character string

of the fact that R allows you to name the elements of a vector, through the func- tion names(). Named elements are sometimes preferable because their positions are easier to memorize. For instance, imagine you have a vector of measure- ments of a chemical parameter obtained on 5 different places. You could create a named vector as follows,

index vectors

(DRAFT - May 22, 2003)