Data Mining with R:
learning by case studies
Luis Torgo
LIACCFEP, University of Porto R. Campo Alegre, 823  4150 Porto, Portugal email: ltorgo@liacc.up.pt http://www.liacc.up.pt/∼ltorgo
May 22, 2003
Preface
The main goal of this book is to introduce the reader to the use of R as a tool for performing data mining. R is a freely downloadable ^{1} language and environment for statistical computing and graphics. Its capabilities and the large set of available packages make this tool an excellent alternative to the existing (and expensive!) data mining tools. One of the key issues in data mining is size. A typical data mining problem involves a large database from where one seeks to extract useful knowledge. In this book we will use MySQL as the core database management system. MySQL
is also freely available ^{2} for several computer platforms. This means that you
will be able to perform “serious” data mining without having to pay any money
at all. Moreover, we hope to show you that this comes with no compromise in
the quality of the obtained solutions. Expensive tools do not necessarily mean better tools! R together with MySQL form a pair very hard to beat as long as you are willing to spend some time learning how to use them. We think that it is worthwhile, and we hope that you are convinced as well at the end of reading this book.
The goal of this book is not to describe all facets of data mining processes.
Many books exist that cover this area. Instead we propose to introduce the reader to the power of R and data mining by means of several case studies. Obviously, these case studies do not represent all possible data mining problems that one can face in the real world. Moreover, the solutions we describe can not be taken as complete solutions. Our goal is more to introduce the reader
to the world of data mining using R through pratical examples. As such our
analysis of the cases studies has the goal of showing examples of knowledge extraction using R, instead of presenting complete reports of data mining case studies. They should be taken as examples of possible paths in any data mining project and can be used as the basis for developping solutions for the reader’s data mining projects. Still, we have tried to cover a diverse set of problems posing diﬀerent challenges in terms of size, type of data, goals of analysis and tools that are necessary to carry out this analysis.
We do not assume any prior knowledge about R. Readers that are new to
R 
and data mining should be able to follow the case studies. We have tried 
to 
make the diﬀerent case studies selfcontained in such a way that the reader 
can start anywhere in the document. Still, some basic R functionalities are introduced in the ﬁrst, simpler, case studies, and are not repeated, which means that if you are new to R, then you should at least start with the ﬁrst case
^{1} Download it from http://www.Rproject.org. ^{2} Download it from http://www.mysql.com.
iii
iv
studies to get acquainted with R. Moreover, the ﬁrst chapter provides a very short introduction to R basics, which may facilitate the understanding of the following chapters. We also do not assume any familiarity with data mining or statistical techniques. Brief introductions to diﬀerent modeling approaches are provided as they are necessary in the case studies. It is not an objective of this book to provide the reader with full information on the technical and theoretical details of these techniques. Our descriptions of these models are given to provide basic understanding on their merits, drawbacks and analysis objectives. Other existing books should be considered if further theoretical insights are required. At the end of some sections we provide “Further readings” pointers for the readers interested in knowing more on the topics. In summary, our target readers are more users of data analysis tools than researchers or developers. Still, we hope the latter also ﬁnd reading this book useful as a form of entering the “world” of R and data mining. The book is accompanied by a set of freely available R source ﬁles that can be obtained at the book Web site ^{3} . These ﬁles include all the code used in the case studies. They facilitate the “do it yourself” philosophy followed in this document. We strongly recommend that readers install R and try the code as they read the book. All data used in the case studies is available at the book Web site as well.
(DRAFT  May 22, 2003)
Contents
Preface 
iii 

1 Introduction 
1 

1.1 How to read this book? 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
2 

1.2 A short introduction to R 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
3 

1.2.1 Starting with R . 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
3 

1.2.2 R objects 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
5 

1.2.3 
Vectors 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
6 

1.2.4 
Vectorization 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
8 

1.2.5 
Factors 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
9 

1.2.6 Generating sequences 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
11 

1.2.7 Indexing . 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
12 

1.2.8 Matrices and arrays 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
14 

1.2.9 
Lists 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
17 

1.2.10 Data frames . 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
20 

1.2.11 Some useful functions 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
23 

1.2.12 Creating new functions 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
25 

1.2.13 Managing your sessions 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
28 

1.3 A short introduction to MySQL 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
29 

2 Predicting Algae Blooms 
33 

2.1 Problem description and objectives 
33 

2.2 Data Description 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
34 

2.3 Loading the data into R 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
34 

2.4 Data Visualization and Summarization 
35 

2.5 Unknown values 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
42 

2.5.1 Removing the observations with unknown values 
43 

2.5.2 Filling in the unknowns with the most frequent values 
44 

2.5.3 Filling in the unknown values by exploring correlations 
. 
45 

2.5.4 Filling in the unknown values by exploring similarities 

between cases . 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
48 

2.6 Obtaining prediction models . 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
52 

2.6.1 Multiple linear regression 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
52 

2.6.2 Regression trees 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
58 

2.7 Model evaluation and selection 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
64 

2.8 Predictions for the 7 algae . 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
70 

2.8.1 Preparing the test data 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
71 

2.8.2 Comparing the alternative models 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
71 
v
vi
CONTENTS
2.8.3 
Obtaining the prediction for the test samples . 
. 
. 
. 
. 
. 
. 
74 

2.9 
Summary . 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
77 

3 
Predicting Stock Market Returns 
79 

3.1 Problem description and objectives 
79 

3.2 The available data 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
80 

3.2.1 Reading the data from the CSV ﬁle 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
81 

3.2.2 Reading the data from a MySQL database 
82 

3.2.3 Getting the data from the Web 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
85 

3.3 Time series predictions . 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
87 

3.3.1 Obtaining time series prediction models 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
90 

3.3.2 Evaluating time series models 
96 

3.3.3 Model selection . 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
100 

3.4 From predictions into trading actions 
103 

3.4.1 Evaluating trading signals . 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
104 

3.4.2 A simulated trader 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
106 

3.5 Going back to data selection . 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
110 

3.5.1 
Enriching the set of predictor variables 
110 

Bibliography 
119 
(DRAFT  May 22, 2003)
Chapter 1
Introduction
R is a language and an environment for statistical computing. It is similar to the S language developed at AT&T Bell Laboratories by Rick Becker, John Chambers and Allan Wilks. There are versions of R for the Unix, Windows and Mac families of operating systems. Moreover, R runs on diﬀerent computer architectures like Intel, PowerPC, Alpha systems and Sparc systems. R was initially developed by Ihaka and Gentleman (1996) both from the University of Auckland, New Zealand. The current development of R is carried out by a core team of a dozen people from diﬀerent institutions around the world. R development takes advantage of a growing community that cooperates in its development due to its open source philosophy. In eﬀect, the source code of every R component is freely available for inspection and/or adaptation. There are many critics to the open source model. Most of them mention the lack of support as the main drawback of open source software. It is certainly not the case with R! There are many excellent documents, books and sites that provide free information on R. Moreover, the excellent Rhelp mailing list is a source of invaluable advice and information, much better then any amount of money could ever buy! There are also searchable mailing lists archives ^{1} that you can use before posting a question. Data Mining has to do with the discovery of useful, valid, unexpected and understandable knowledge from data. These general objectives are obviously shared by other disciplines like statistics, machine learning or pattern recogni tion. One of the most important distinguishing issues in data mining is size. With the advent of computer technology and information systems, the amount of data available for exploration has increased exponentially. This poses diﬃcult challenges to the standard data analysis disciplines: one has to consider issues like computational eﬃciency, limited memory resources, interfaces to databases, etc. All these issues turn data mining into a highly interdisciplinary subject in volving tasks not only of typical data analysts but also of people working with databases, data visualization on high dimensions, etc. R has limitations with handling enormous datasets because all computation is carried out in the main memory of the computer. This does not mean that we will not be able to handle these problems. Taking advantage of the highly ﬂexible database interfaces available in R, we will be able to perform data mining
1
Architectures and operating systems on which R runs
What is Data Mining?
2
INTRODUCTION
The MySQL DBMS
on large problems. Being faithful to the Open Source philosophy we will use the
2 . MySQL is also available for a
quite large set of computer platforms and operating systems. Moreover, R has
a package that enables an easy interface to MySQL (package “RMySQL”). In summary, we hope that at the end of reading this book you are convinced that you can do data mining on large problems without having to spend any money at all! That is only possible due to the generous and invaluable contri bution of lots of scientists that build such wonderful tools as R and MySQL.
excellent MySQL database management system
1.1 How to read this book?
The main spirit behind the book is:
Check the book Web site!
Learn by doing it!
The book is organized as a set of case studies. The “solutions” to these
case studies are obtained using R. All necessary steps to reach the solutions
3 you may get all code included in
the document, as well as all data of the case studies. This should facilitate trying them out by yourself. Ideally, you should read this document beside your
are described. Using the book Web site
computer and try every step as it is presented to you in the document. R code
is shown in the book using the following font,
> R.version
platform 
i686pclinuxgnu 

arch 
i686 

os 
linuxgnu 

system 
i686, 
linuxgnu 
status 

major 
1 

minor 
7.0 

year 
2003 

month 
04 

day 
16 

language 
R 
R commands are entered at R command prompt, “>”. Whenever you see this prompt you may interpret it as R being waiting for you to enter a command. You type in the commands at the prompt and then press the enter key to ask R to execute them. This usually produces some form of output (the result of the command) and then a new prompt appears. At the prompt you may use the up arrow key to browse to previously entered commands. This is handy when you want to type commands similar to what you have done before as you avoid typing them again. Still, you may take advantage of the code provided at the book Web site to cut and paste between your browser and the R console, thus avoiding having to type all commands described in the book. This will surely facility your learning, and improve your understanding of its potential.
^{2} Free download at http://www.mysql.com. ^{3} http://www.liacc.up.pt/~ltorgo/DataMiningWithR/.
(DRAFT  May 22, 2003)
1.2 A short introduction to R
3
1.2 A short introduction to R
The goal of this section is to provide a brief introduction to the key issues of the R language. We do not assume any familiarity with computer programming. Every reader should be able to follow the examples presented on this section. Still, if you feel some lack of motivation for continuing reading this introductory material do not worry. You may proceed to the case studies and then return to this introduction as you get more motivated by the concrete applications.
R is a functional language for statistical computation and graphics. It can
be seen as a dialect of the S language (developed at AT&T) for which John Chambers was awarded the 1998 Association for Computing Machinery (ACM) Software award which mentioned that this language “forever altered how people
analyze, visualize and manipulate data”.
R can be quite useful just by using it in an interactive fashion. Still more
advanced uses of the system will lead the user to develop his own functions to systematize repetitive tasks, or even to add or change some functionalities of the existing addon packages, taking advantage of being open source.
1.2.1 Starting with R
In order to install R in your system the easiest way is to obtain a binary distri CRAN option from this menu. This option will present a list of the packages 
Downloading R 

bution from the R Web site ^{4} where you may follow the link that takes you to the CRAN (Comprehensive R Archive Network) site to obtain, among other things, the binary distribution for your particular operating system/architecture. If you prefer to build R directly from the sources you may get instructions on how to do it from CRAN. After downloading the binary distribution for your operating system you just need to follow the instructions that come with it. In the case of the Windows 
Installing R 

version, you simply execute the downloaded ﬁle (rw1061.exe) ^{5} and select the options you want in the following menus. In the case of Unixlike operating systems you should contact your system administrator to fulﬁll the installation task as you will probably not have permissions for that. To run R in Windows you simply double click the appropriate icon on your 
Starting R 

desktop, while in Unix versions you should type R at the operating system prompt. Both will bring up the R console with its prompt “>”. If you want to quit R you may issue the command q() at the prompt. You 
Quitting R 

will be asked if you want to save the current workspace. You should answer yes only if you want to resume your current analysis at the point you are leaving it, later on. Although the set of tools that comes with R is by itself quite powerful, it is only natural that you will end up wanting to install some of the large (and 
Installing addon 

growing) set of addon packages available for R at CRAN. In the Windows version this is easily done through the “Packages” menu. After connecting your computer to the Internet you should select the “Install package from ” 
packages 
available at CRAN. You select the one(s) you want and R will download the package(s) and selfinstall it(them) on your system. In Unix versions things
1.6.1.
(DRAFT  May 22, 2003)
4
INTRODUCTION
are slightly diﬀerent as R is a console program without any menus. Still the operation is simple. Suppose you want to download the package that provides functions to connect to MySQL databases. This package name is RMySQL ^{6} . You just need to type the following two commands at R prompt:
> options(CRAN=’http://cran.rproject.org’)
> install.package(‘‘RMySQL’’)
The ﬁrst instruction sets the option that determines the site from where the packages will be downloaded. Actually, this instruction is not necessary as this is the default value of the CRAN option. Still, you may use such an instruction for selecting a nearest CRAN mirror ^{7} . The second instruction performs the actual downloading and installation of the package ^{8} . If you want to known the packages currently installed in your distribution you may issue,
> installed.packages()
This produces a long output with each line containing a package, its version information, the packages it depends on, and so on. Another useful command is the following, which allows you to check whether there are newer versions of your installed packages at CRAN,
> old.packages()
Moreover, you may use the following command to update all your installed packages ^{9} ,
> update.packages()
Getting help in R R has an integrated help system that you can use to know more about the system and its functionalities. Moreover, you can ﬁnd extra documentation at the R site (http://www.rproject.org). R comes with a set of HTML ﬁles that can be read using a Web browser. On Windows versions of R these pages are ac cessible through the help menu. Alternatively, you may issue “help.start()” at the prompt to launch the HTML help pages. Another form of getting help is to use the help() function. For instance, if you want some help on the plot() function you can enter the command “help(plot)” (or in alternative ?plot). In this case, if the HTML viewer is running the help will be shown there, otherwise it will appear in the R console as text information.
^{6} You can get an idea of the functionalities of each of the R packages in the R FAQ (frequently asked questions) at CRAN. ^{7} The list of available mirrors can be found at http://cran.rproject.org/mirrors.html. ^{8} Please notice that to carry out these tasks on Unix systems you will most surely need to have root permissions, so the best is to ask you system administrator to do the installation. Still, it is also possible to download and install the packages on your personal home directory (consult the R help facilites to check how). ^{9} You need root permissions in Linux distributions to do this.
(DRAFT  May 22, 2003)
1.2 A short introduction to R
5
1.2.2 R objects
R is an objectoriented language. All variables, data, functions, etc. are stored in the memory of the computer in the form of named objects. By simply typing the name of an object at the R prompt one can see its contents. For example, if you have a data object named x with the value 945, typing its name at the prompt would show you its value,
> x
[1]
945
The rather cryptic “[1]” in front of the number 945 can be read as “this line is showing values starting from the ﬁrst element of the object”. This is particularly useful for objects containing several values like vectors, as we will see later. Values may be assigned to objects using the assignment operator. This consists of either an angle bracket followed by a minus sign (<), or a minus sign followed by an angle bracket (>). Below you may ﬁnd two alternative and equivalent forms of assigning a value to the object y ^{1}^{0} ,
The assignment
operator
> 
y 
< 
39 
> 
y 

[1] 
39 

> 
43 
> 
y 
> 
y 

[1] 
43 
You may also assign numerical expressions to an object. In this case the object will store the result of the expression,
> z 
< 
5 

> w 
< 
z^2 

> w 

[1] 
25 

> + < i (z*2 
45)/2 

> i 

[1] 
27.5 
Whenever, you want to assign an expression to an object and then printout the result (as in the previous small examples), you may alternatively surround the assignment statement in parentheses:
>
(w
[1]
<
25
z^2)
You do not need to assign the result of an expression to an object. In eﬀect, you may use R prompt as a kind of calculator,
>
(34
+
[1]
9.92
90)/12.5
(DRAFT  May 22, 2003)
6
INTRODUCTION
Every object you create will stay in the computer memory until you delete it. You may list the objects currently in the memory by issuing the ls() or objects() commands at the prompt. If you do not need an object you may free some memory space by removing it,
> ls()
"i"
> rm(y)
> rm(z,w,i)
"w"
[1]
"y"
"z"
Listing and deleting objects
Valid object names Object names may consist of any upper and lowercase letters, the digits 09 (except in the beginning of the name), and also the period, “.”, which behaves like a letter but may not appear at the beginning of the name (as digits). Note that names in R are case sensitive, meaning that Color and color are two distinct objects. Important Note: In R you are not allowed to use the underscore character
(“
”) in object names. ^{1}^{1}
1.2.3
Vectors
The most basic data object in R is a vector. Even when you assign a single
Types of vectors 
number to an object (like in x < 45.3) you are creating a vector containing a single element. All data objects have a mode and a length. The mode determines the kind of data stored in the object. It can take the values character, logical, numeric or complex. Thus you may have vectors of characters, logical values 

(T 
or F or FALSE or TRUE) ^{1}^{2} , numbers, and complex numbers. The length of an 

object is the number of elements in it, and can be obtained with the function length(). 

Creating vectors 
Most of the times you will be using vectors with length larger than 1. You may create a vector in R , using the c() function, 

> < v c(4,7,23.5,76.2,80) 

> v 

[1] 
4.0 
7.0 
23.5 
76.2 
80.0 

> 
length(v) 

[1] 
5 

> 
mode(v) 

[1] 
"numeric" 

Type coercion 
All elements of a vector must belong to the same mode. If that is not true R will force it by type coercion. The following is an example of this, 
> c(4,7,23.5,76.2,80,"rrt") v < 

> v 

[1] 
"4" 
"7" 
"23.5" 
"76.2" 
"80" 
"rrt" 
^{1}^{0} Notice how the assignment is a destructive operation (previous stored values in an object are discarded by new assignments). ^{1}^{1} This is a common cause of frustration for experienced programmers as this is a character commonly used in other languages. ^{1}^{2} Recall that R is casesensitive, thus, for instance, True is not a valid logical value.
(DRAFT  May 22, 2003)
1.2 A short introduction to R
7
All elements of the vector have been converted to character mode. Character values are strings of characters surrounded by either single or double quotes. All vectors may contain a special value named NA. This represents a missing value, 
Missing values 

> < v 
c(NA,"rrr") 

> v 

[1] NA 
"rrr" 

> < u 
c(4,6,NA,2) 

> u 

[1] 4 
6 
NA 
2 

k > < 
c(T,F,NA,TRUE) 

> k 

[1] TRUE 
FALSE 
NA 
TRUE 

You may access a particular element of a vector through an index, 
Accessing individual 

elements 

> v[2] 

[1] "rrr" 
You will learn in Section 1.2.7 that we may use vectors of indexes to obtain more powerful indexing schemes. You may also change the value of one particular vector element,
Changing an element
> v[1] 
< 
’hello’ 

> v 

[1] 
"hello" 
"rrr" 

R allows you to create empty vectors like this, 
Empty vectors 

> 
v 
< 
vector() 

The length of a vector may be changed by simply adding more elements to it 
Adding more 

using a previously nonexistent index. For instance, after creating empty vector v you could type, 
elements 

> 
v[3] 
< 
45 

> 
v 

[1] 
NA 
NA 
45 

Notice how the ﬁrst two elements have an unknown value, NA. To shrink the size of a vector you can use the assignment operation. For instance, 
Removing elements 
> c(45,243,78,343,445,645,2,44,56,77) v < 

> v 

[1] 45 
243 
78 
343 
445 
645 
2 
44 
56 
77 
> c(v[5],v[7]) v < 

> v 

[1] 445 
2 
Through the use of more powerful indexing schemes to be explored in Section 1.2.7 you will be able delete particular elements of a vector in an easier way.
(DRAFT  May 22, 2003)
8
INTRODUCTION
1.2.4
Vectorization
One of the most powerful aspects of the R language is the vectorization of several of its available functions. These functions operate directly on each element of a vector. For instance,
> c(4,7,23.5,76.2,80)
x
> x
[1]
>
v
<
<
sqrt(v)
2.000000
2.645751
4.847680
8.729261
8.944272
Vector arithmetic 
The function sqrt() calculates the square root of its argument. In this case we have used a vector of numbers as its argument. Vectorization leads the function to produce a vector of the same length, with each element resulting from applying the function to every element of the original vector. You may also use this feature of R to carry out vector arithmetic, 

> v1 
< 
c(4,6,87) 

> v2 
< 
c(34,32.4,12) 

> v1+v2 

[1] 
38.0 
38.4 
99.0 

The recycling rule 
What if the vectors do not have the same length? R will use a recycling rule by repeating the shorter vector till it ﬁlls in the size of the larger. For example, 

> v1 
< 
c(4,6,8,24) 

> v2 
< 
c(10,2) 
> v1+v2
[1]
14
8
18
26
It is just as if the vector c(10,2) was c(10,2,10,2). If the lengths are not multiples than a warning is issued,
> 
v1 
< 
c(4,6,8,24) 
> v2 
< 
c(10,2,4) 
> v1+v2
[1]
14
8
12
34
Warning 
message: 

longer 
object 
length 
is
not
a
multiple
of
shorter
object
length
in:
v1
+
v2
Still, the recycling rule has been used, and the operation is carried out (it is a warning, not an error!). As mentioned before single numbers are represented in R as vectors of length 1. This is very handy for operations like the one shown below,
>
> 2*v1
v1
<
[1]
8
c(4,6,8,24)
12
16
48
Notice how the number 2 (actually the vector c(2)!) was recycled, resulting in multiplying all elements of v1 by 2. As we will see, this recycling rule is also applied with other objects, like arrays and matrices.
(DRAFT  May 22, 2003)
1.2 A short introduction to R
9
1.2.5
Factors
Factors provide an easy and compact form of handling categorical (nominal) data. Factors have levels which are the possible values they may take. A factor
is stored internally by R as a numeric vector with values
number of levels of the factor. Factors are particularly useful in datasets where you have nominal variables with a ﬁxed number of possible values. Several graphical and summarization functions that we will explore in later chapters take advantage of this information.
,k, where k is the
Let us see how to create factors in R. Suppose you have a vector with the sex of 10 individuals,
> 
g 
< 
c(’f’,’m’,’m’,’m’,’f’,’m’,’f’,’m’,’f’,’f’) 
> 
g 
[1]
"f"
"m"
"m"
"m"
"f"
"m"
"f"
"m"
"f"
"f"
You can transform this vector into a factor by entering,
>
>
g
g
[1]
<
f
Levels:
factor(g)
m
m
f
m
m
f
m
f
m
f
f
Creating a factor
Notice that you do not have a character vector anymore. Actually, as men tioned above, factors are represented internally as numeric vectors ^{1}^{3} . In this example, we have two levels, ’f’ and ’m’, which are represented internally as 1 and 2, respectively. Suppose you have 4 extra individuals whose sex information you want to store in another factor object. Suppose that they are all males. If you still want the factor object to have the same two levels as object g, you may issue the following,
>
>
other.g
other.g
m
[1]
m
m
Levels:
f
<
m
m
m
factor(c(’m’,’m’,’m’,’m’,’m’),levels=c(’f’,’m’))
Without the levels argument the factor other.g would have a single level (’m’). One of the many things you can do with factors is to count the occurrence of each possible value. Try this,
> 
table(g) 
g 

f 
m 
5 
5 
> 
table(other.g) 
other.g 

f 
m 
0 
5 
^{1}^{3} You may conﬁrm it by typing mode(g).
(DRAFT  May 22, 2003)
Frequency tables for factors
10
INTRODUCTION
The table() function can also be used to obtain crosstabulation of several factors. Suppose that we have in another vector the age category of the 10 individuals stored in vector g. You could cross tabulate these two vectors as follows,
> 
a 
< factor(c(’adult’,’adult’,’juvenile’,’juvenile’,’adult’,’adult’, 

+ 
’adult’,’juvenile’,’adult’,’juvenile’)) 

> 
table(a,g) 

g 

a 
f 
m 

adult 
4 
2 

juvenile 
1 
3 
If you
hit the “return” key before ending some command, R presents a continuation prompt (the “+” sign) for you to complete the instruction. Sometimes we wish to calculate the marginal and relative frequencies for this type of tables. The following gives you the totals for both the sex and the age factors of this data,
Notice how we have introduced a long command in several lines.
> 
t 
< 
table(a,g) 

> 
margin.table(t,1) 

a 

adult juvenile 

6 
4 

> 
margin.table(t,2) 

g 

f 
m 

5 
5 

For relative frequencies with respect to each margin and overall we do, 

> 
prop.table(t,1) 

g 

a 
f 
m 

adult 
0.6666667 
0.3333333 

juvenile 
0.2500000 
0.7500000 

> 
prop.table(t,2) 

g 

a 
f 
m 

adult 
0.8 
0.4 

juvenile 
0.2 
0.6 

> 
prop.table(t) 

g 

a 
f 
m 

adult 
0.4 
0.2 

juvenile 
0.1 
0.3 
Notice that if we wanted percentages instead we could simply multiply these function calls by 100.
(DRAFT  May 22, 2003)
1.2 A short introduction to R
11
Integer sequences
1.2.6 Generating sequences
R has several facilities to generate diﬀerent types of sequences. For instance, if you want to create a vector containing the integers between 1 and 1000, you can simply type,
>
which creates a vector x containing 1000 elements, the integers from 1 to 1000. You should be careful with the precedence of the operator “:”. The following examples illustrate this danger,
> 10:151
x
<
1:1000
[1]
9
10
11
>
10:(151)
[1]
10
11
12
12
13
13
14
14
Please make sure you understand what happened in the ﬁrst command (re member the recycling rule!). You may also generate decreasing sequences like the following,
> 5:0
[1] 
5 
4 
3 
2 
1 
0 

To generate sequences of real numbers you can use the function seq(). The instruction 
Sequences of real numbers 

> 
seq(4,1,0.5) 

[1] 
4.0 
3.5 
3.0 
2.5 
2.0 
1.5 
1.0 
0.5 
0.0 
0.5 
1.0 

generates a sequence of real numbers between 4 and 1 in increments of 0.5. Here are a few other examples of the use of the function seq() ^{1}^{4} , 

> 
seq(from=1,to=5,length=4) 

[1] 
1.000000 
2.333333 
3.666667 
5.000000 

> 
seq(from=1,to=5,length=2) 

[1] 
1 
5 

> seq(length=10,from=2,by=.2) 

[1] 
2.0 
1.8 
1.6 
1.4 
1.2 
1.0 
0.8 
0.6 
0.4 
0.2 

Another very useful function to generate sequences is the function rep(), 
Sequences with 

> rep(5,10) 
repeated elements 

[1] 
5 
5 
5 
5 
5 
5 
5 
5 
5 
5 

> rep(’hi’,3) 

[1] 
"hi" 
"hi" 
"hi" 

> 
rep(1:3,2) 

[1] 
1 
2 
3 
1 
2 
3 

The function gl() can be used to generate sequences involving factors. The syntax of this function is gl(k,n), where k is the number of levels of the factor, and n the number of repetitions of each level. Here are two examples, 
Factor sequences 
^{1}^{4} You may want to have a look at the help page of the function (typing for instance ’?seq’), to better understand its arguments and variants.
(DRAFT  May 22, 2003)
12
INTRODUCTION
> gl(3,5)
[1] 
1 
1 
1 
1 
1 
2 
2 
2 
2 
2 
3 
3 
3 
3 
3 
Levels: 
1 
2 
3 
> gl(2,5,labels=c(’female’,’male’))
[1]
female
female
female
Levels:
female
male
female
female
male
male
male
male
male
Finally, R has several functions that can be used to generate random se Random sequences quences according to a large set of probability density functions. The func
tions have the generic structure rfunc(n, par1, par2,
the name of the density function, n is the number of data to generate, and
par1, par2,
that may be necessary. For instance, if you want 10 randomly generated num bers from a normal distribution with zero mean and unit standard deviation, type
are the values of some parameters of the density function
where func is
),
> rnorm(10)
[1] 0.306202028 0.335295844 1.199523068 2.034668704 0.273439339 [6] 0.001529852 1.351941008 1.643033230 0.927847816 0.163297158
while if you prefer a mean of 10 and a standard deviation of 3, you should use
>
rnorm(10,mean=10,sd=3)
[1] 7.491544 12.360160 12.879259 5.307659 11.103252 18.431678 9.554603 [8] 9.590276 7.133595 5.498858
To get 5 numbers drawn randomly from a Student t distribution with 10 degrees of freedom, type
> rt(5,df=10)
[1]
0.46608438
0.44270650
0.03921861
0.18618004
2.23085412
R has many more probability functions, as well as other functions for ob taining the probability densities, the cumulative probability densities and the quantiles of these distributions.
1.2.7
Indexing
We have already seen examples on how to get one element of a vector by in dicating its position between square brackets. R also allows you to use vectors within the brackets. There are several types of index vectors. Logical index Logical index vectors vectors extract the elements corresponding to true values. Let us see a concrete example.
> < x 
c(0,3,4,1,45,90,5) 

> x 

[1] 
0 
3 
4 
1 
45 
90 
5 

> 
x 
> 
0 

[1] 
FALSE 
FALSE 
TRUE 
FALSE 
TRUE 
TRUE 
FALSE 

> y 
< 
x>0 
The third instruction of the code shown above is a logical condition. As x is a vector, the comparison is carried out for all elements of the vector (remember
(DRAFT  May 22, 2003)
1.2 A short introduction to R
13
the famous recycling rule!), thus producing a vector with as many logical values as there are elements in x. We then store this logical vector in object y. You can now obtain the positive elements in x using this vector y as a logical index vector,
> x[y]
[1]
4
45
90
As the truth elements of vector y are in the 3rd, 5th and 6th positions, this corresponds to extracting these elements from x. Incidentally, you could achieve the same result by just issuing,
> x[x>0]
[1]
4
45
90
Taking advantage of the logical operators available in R you may use more complex logical index vectors, as for instance,
> 
x[x 
<= 
2 
 
x 
> 
5] 

[1] 3 
45 
90 
5 

> 
x[x 
> 40 & 
x 
< 
100] 

[1] 45 
90 
As you may have guessed, the “” operator performs logical disjunction, while the “&” operator is used for logical conjunction. This means that the ﬁrst instruction shows us the elements of x that are either less or equal to 2, or greater than 5. The second example presents the elements of x that are both greater than 40 and less than 100. R also allows you to use a vector of integers to extract elements from a vector. The numbers in this vector indicate the positions in the original vector to be extracted,
> x[c(4,6)]
Integer index vectors
[1] 
1 
90 

> x[1:3] 

[1] 
0 
3 
4 

Alternatively, you may use a vector with negative indexes, to indicate which 
Negative integer 

elements are to be excluded from the selection, 
index vectors 

> x[1] 

[1] 
3 
4 
1 
45 
90 
5 

> x[c(4,6)] 

[1] 
0 
3 
4 
45 
5 

> x[(1:3)] 

[1] 
1 
45 
90 
5 

Note the need for parentheses in the last example due to the precedence of the “:” operator. Indexes may also be formed by a vector of character strings taking advantage 
Character string 

of the fact that R allows you to name the elements of a vector, through the func tion names(). Named elements are sometimes preferable because their positions are easier to memorize. For instance, imagine you have a vector of measure ments of a chemical parameter obtained on 5 diﬀerent places. You could create a named vector as follows, 
index vectors 
(DRAFT  May 22, 2003)
Гораздо больше, чем просто документы.
Откройте для себя все, что может предложить Scribd, включая книги и аудиокниги от крупных издательств.
Отменить можно в любой момент.