Вы находитесь на странице: 1из 126

Analysis of

Ecological
Data with R

Georg Hrmann
Institute for Natural Resource
Conservation
ghoermann@hydrology.uni-kiel.de

Ingmar Unkel
Institute for Ecosystem Research
iunkel@ecology.uni-kiel.de

Christian-Albrechts-Universitt zu
Kiel

9/23/15 / 20:46:12 / 1 of 126 / D:\ms\Dropbox\lehre\statistik\buch_neu_ws14\analysis_ecological_data_ws15_16.odt


Changelog: (what changed when)

Fall 2001: first version


March 2002: revision of the structure, minor adjustments, approx. 300 downloads per
month
September 2002: revision, minor adjustments, conversion to OpenOffice using Debian
January 2003: revision, new version of phpMyAdmin, pivot table written out
October 2003: revision for the winter semester course
October 2004: revision for the course, now four hours long, new orthography (partially)
June 2011: Translation by Kevin Callon
October 2011: complete rewrite
October 2011: revision of the revision. Ingmar was joining the team and was trying to
topple all over
October 2012: cluster analysis and ordination added
October 2013: minor corrections, added simple ggplot2 examples
Spring 2014-Autumn 2014: removed all references to spreadsheets, replaced by R
operators & functions
Autumn 2014: started to replace data management functions with dplyr
Autumn 2015: adapting everything to the Hadley Wickham-Universe (dplyr, tidyr,
reshape2, ggplot2, readxl)

Author's Copyright

This book, or whatever one chooses to call it, is subject to the GNU license (GPL, full details
available on every good search engine). It may be further distributed as long as no money is
requested or charged for it.
Table of Contents
1 Introduction............................................................................................7
1.1 Excursus: Freedom for Software: The Linux or Microsoft Question..................................8

2 Basics with R..........................................................................................10


2.1 Installation.................................................................................................................................12
2.1.1 Base System.......................................................................................................................12
2.1.2 User Interfaces..................................................................................................................12
2.1.2.1 Rcmdr.........................................................................................................................12
2.1.2.2 Rstudio.......................................................................................................................15
2.2 The Hamburg climate data set...............................................................................................16
2.3 Import of data...........................................................................................................................18
2.3.1 Recommended input format..........................................................................................18
2.3.2 Import of text data (csv, ASCII) with Rcmdr................................................................19
2.3.3 Direct import of Excel Files............................................................................................20
2.3.4 Import with Rstudio.........................................................................................................22
2.3.5 First data checks, basic commands...............................................................................22
2.4 Working with variables...........................................................................................................24
2.4.1 Variable types and conversions.....................................................................................24
2.4.2 Accessing variables..........................................................................................................24
2.4.3 Coding date and time.......................................................................................................26
2.5 Simple figures...........................................................................................................................26
2.6 Tasks...........................................................................................................................................27
2.7 Debugging R scripts..................................................................................................................28

3 Data management with dplyr and reshape2..........................................29


3.1 Data Organization.....................................................................................................................29
3.1.1 Optimum structure of data bases..................................................................................29
3.1.2 Dealing with Missing Values...........................................................................................30
3.2 Basic use of dplyr......................................................................................................................30
3.3 Reshaping data sets between narrow and wide..................................................................32
3.4 Merging data bases...................................................................................................................32

4 Exploratory Data analysis......................................................................34


4.1 Simple numeric analyses.........................................................................................................34
4.2 Simple graphic methods..........................................................................................................35
4.3 Line and scatter plots...............................................................................................................35
4.4 Combined figures......................................................................................................................37
4.4.1 Figures in regular matrix with mfrow().......................................................................37
4.4.2 Nested figures with split.screen()..................................................................................37
4.4.3 Free definition of figure position...................................................................................38
4.4.4 Presenting simulation results........................................................................................40
4.4.5 Combined figures with the lattice package..................................................................41
4.5 Brushing up your plots made with the standard system...................................................43
4.5.1 Setting margins.................................................................................................................43
4.5.2 Text for title and axes......................................................................................................44
4.5.3 Colors..................................................................................................................................44
4.5.4 Legend................................................................................................................................44
4.5.5 More than two axes..........................................................................................................45
4.6 Plots with ggplot2.....................................................................................................................45
4.6.1 Simple plots.......................................................................................................................45
4.6.2 Multiple plots with ggplot2, gridExtra version...........................................................46
4.6.3 Multiple plots with ggplot2, viewport version............................................................47
4.7 Saving Figures...........................................................................................................................48
4.8 Scatterplot matrix plots..........................................................................................................48
4.9 3d Images...................................................................................................................................50

5 Bivariate Statistics.................................................................................52
5.1 Pearsons Correlation Coefficient..........................................................................................53
5.2 Correlograms correlation matrices....................................................................................58
5.3 Classical Linear Regression.....................................................................................................60
5.3.1 Analyzing the Residuals..................................................................................................61

6 Univariate Statistics..............................................................................66
6.1 Student's t Test.........................................................................................................................66
6.2 Welsh's t Test............................................................................................................................68
6.3 F-Test..........................................................................................................................................69
6.4 c-Test Goodness of fit test...................................................................................................69

7 Multiple and curvilinear regression......................................................72


7.1 Multiple linear regression.......................................................................................................72
7.2 Curvilinear regression.............................................................................................................72

8 Cluster Analysis.....................................................................................79
8.1 Measures of distance................................................................................................................79
8.2 Agglomerative hierarchical clustering.................................................................................82
8.2.1 Linkage methods..............................................................................................................82
8.2.2 Clustering Algorithm.......................................................................................................83
8.2.3 Clustering in R..................................................................................................................84
8.3 Kmeans clustering....................................................................................................................85
8.4 Chapter exercises.....................................................................................................................87
8.5 Problems of cluster analysis...................................................................................................88
8.6 R code library for cluster analysis.........................................................................................89

9 Ordination.............................................................................................90
9.1 Principle Component Analysis (PCA)....................................................................................90
9.1.1 The principle of PCA explained......................................................................................90
9.1.2 PCA in R..............................................................................................................................93
9.1.2.1 Selecting the number of components to extract................................................94
9.1.3 PCA exercises....................................................................................................................94
9.1.4 Problems of PCA and possible alternatives..................................................................96
9.2 Multidimensional scaling (MDS)............................................................................................96
9.2.1 Principle of a NMDS algorithm.......................................................................................96
9.2.2 NMDS in R..........................................................................................................................97
9.2.3 NMDS Exercises................................................................................................................99
9.2.4 Considerations and problems of NMDS........................................................................99
9.3 R code library for ordination................................................................................................101

10 Spatial Data........................................................................................102
10.1 First example.........................................................................................................................102
10.2 Point Data..............................................................................................................................103
10.2.1 Bubble plots...................................................................................................................103
10.3 Raster data.............................................................................................................................104
10.4 Vector Data............................................................................................................................105
10.5 Working with your own maps............................................................................................107

11 Time Series Analysis..........................................................................109


11.1 Definitions.............................................................................................................................109
11.2 Data sets.................................................................................................................................110
11.3 Data management of TS.......................................................................................................110
11.3.1 Conversion of variables to TS.....................................................................................110
11.3.2 Creating factors from time series..............................................................................111
11.4 Statistical Analysis of TS.....................................................................................................112
11.4.1 Statistical definition of TS...........................................................................................112
11.4.2 Trend Analysis..............................................................................................................113
11.4.2.1 Regression Trends................................................................................................113
11.4.2.2 Filter.......................................................................................................................113
11.5 Removing seasonal influences...........................................................................................113
11.6 Irregular time series............................................................................................................114
11.7 TS in R.....................................................................................................................................114
11.7.1 Auto- and Crosscorrelation........................................................................................115
11.7.2 Fourier- or spectral analysis.......................................................................................115
11.8 Sample data set for TS analysis..........................................................................................117

12 Practical Exercises.............................................................................120
12.1 Tasks.......................................................................................................................................120
12.1.1 Pivot Tables...................................................................................................................120
12.1.2 Regression Line.............................................................................................................121
12.1.3 Database Functions......................................................................................................121
12.1.4 Frequency Analyses.....................................................................................................121

13 Applied Analysis.................................................................................122
14 Solutions............................................................................................124
Illustration Index
Figure 1: Workflow of an analysis.....................................................................................................11
Figure 2: Installation of Rcmdr..........................................................................................................13
Figure 3: Interface of Rcmdr..............................................................................................................13
Figure 4: After a successful import of the Climate data base........................................................14
Figure 5: File menu of Rmcdr, used to save command and data..................................................15
Figure 6: Rstudio user interface.........................................................................................................16
Figure 7: Source of the climate data set for Hamburg-Fuhlsbttel.............................................17
Figure 8: Contents of the climate archive data...............................................................................17
Figure 9: Common problems in spreadsheet files..........................................................................19
Figure 10: Structure of our climate data base (Hamburg).............................................................19
Figure 11: Import of data....................................................................................................................20
Figure 12: Settings for an import of the climate data set from the clipboard...........................20
Figure 13: Result of a data import.....................................................................................................21
Figure 14: Data import with RStudio................................................................................................22
Figure 15: Control of variable type...................................................................................................23
Figure 16: Frequent problem with a conversion of mixed variables...........................................24
Figure 17: Example of a good and bad database structure for daily time series.......................30
Figure 18: Example of a good and bad database structure for lab data......................................30
Figure 19: Structure of the "molten" data set.................................................................................32
Figure 20: Combining figures with the split.screen() command..................................................38
Figure 21: Layout frame......................................................................................................................39
Figure 22: Result of the layout commands.......................................................................................39
Figure 23: Common display of hydrological simulation results...................................................41
Figure 24: Scatterplot of air temperatures with annual grouping...............................................43
Figure 25: a Probability density function f (x) and b cumulative distribution function F (x)of
a 2 distribution with different values for ...................................................................65
Figure 26: a Probability density function f(x) and b cumulative distribution function F (x) of
a Students t distribution with different values for ....................................................66
Figure 27: Illustration of Jaccard distance.......................................................................................80
Figure 28 : Illustration of Euclidean and Manhattan distance......................................................81
Figure 29: Illustration of different ways to determine the distance between to clusters. For
example by single-linkage (A) or complete-linkage (B).................................................83
Figure 30: Illustration of the agglomerative hierarchical clustering algorithm.......................84
Figure 31: Illustration of the kmeans clustering algorithm..........................................................86
Figure 32: Illustration of the PCA principles...................................................................................91
Figure 33: Graphic result of a PCA.....................................................................................................92
Figure 34: Leptograpsus variegatus..................................................................................................94
Figure 35: Summary of one variable (Mean Temperature).........................................................123
1 Int
There are many books on statistics, difficult to digest and with a tendency to reprint
formulas. There are still more books for every possible type of software, in which
formatting and graphic creation is placidly explained. What's lacking is a compilation of the
methods and tools used daily in practice.
This book is not meant to replace statistical textbooks and programing handbooks, but is
rather meant as a summary for ecologists containing numerous practical tips which
otherwise would have to be gathered from many different sources. It was conceived as an
accompaniment to a course at the Ecology Center of the University of Kiel, in which
students of geography, biology, and agricultural science are introduced to analyzing data
records.
The students have mostly had an introduction to statistics and a basic course in data
processing. Their scope is generally limited and the connection between the two tends not
to have been made despite this knowledge being particularly fundamental and a
prerequisite by the time of the diploma thesis at the latest.
The aim of this book as well as that of our course is to give students an overview of the
methods and tools used to analyze data records based on measurements and modeling. The
structure of this book is built on the work flow used in the analysis of data.
In the review of tools we've made a point of emphasizing open-source software. This is
partly for financial reasons: small institutions and engineering firms often cannot even
afford large and expensive packages, the range of functions of which moreover are often
oriented more toward the needs of bookkeepers than those of scientists. Software from the
realm of natural sciences may often be arduous to learn but are in return more flexible and
productive in the long run.
The data sets for this course are available on a website in the internal e-learning system of
Kiel University (OLAT, https://www.uni-kiel.de/lms/dmz/) where example data and files as
well as the latest version of this book are available for download. Current links to the
recommended software can of course also be found there.
Presuppositions: this book doesn't provide an introduction into the various programs,
rather it presupposes basic knowledge of user software and operating systems. We cover
the area of things which the user needs in practical situations but which were never
mentioned in respective introductory courses.
The authors of this book have seen everything that can go right or wrong. They also take
the point of view that irony, amusement, and the regular enjoyment of Monty Python films
are fundamental requirements for survival when doing scientific work.

Comments on typography
Warnings are displayed like this. They point out everyday (oftentimes
banal) mistakes that can be set off by an hour-long search for the
cause
1: Exercises and homework for courses are marked like this

Further information, literature and Internet addresses

Formulas for R and Excel worksheets look like this

1.1 Excu
Ques
Word may have gotten around by now that we don't live in the best of all worlds; for one
thing PCs and software would be freely available if we did. On the one hand we as users
want to pay as little as possible, but at the same time the programers of our software can't
live on air and appreciation alone, at least not for long. The completely normal capitalistic
model has software sold as any other merchandise and the people who design and build it
paid as any other worker that's the Microsoft version.
The other side views things somewhat more idealistically: software is a human right and
should flow freely in the free stream of ideas. Users and programers constitute an organic
unit and continually develop the product together. The programer earns his/her money not
(only) through the software but through the related rendering of services. There are then
also people who develop software out of idealism and for whom an elegant piece of software
affords the same pleasure as a good concert that's the Linux version. It's significantly
more prevalent in the academic world because many programs developed with
governmental financial support are passed around free of charge.
Why Linux ?
Linux is available freely or inexpensively even for commercial application.
Linux is a modular operating system, so unused functions take up no storage space
and can't crash. It's thus possible, for example, when using systems for data logging
or when using a pure database system, to avoid graphical user interfaces altogether.
Linux systems are also serviceable remotely through slow connections no ifs or
buts. In case of an emergency, just about the whole system can be reinstalled online.
Linux systems run stabler and are less demanding on hardware.
Linux systems are fully documented - all interfaces etc.

Along with the technical arguments there's also the current financial situation of schools
and learning institutions of various levels, and also of many smaller firms. When the
operating system and the office suite together are more expensive than the computer on
which they're installed, many consider whether they shouldn't just buy two PCs with Linux
and LibreOffice. Then come the exorbitant prices for software in the technical/academic
world. When simulation software, geographical information systems (GIS), statistics
packages, and databases are all needed, then the price of the hardware becomes negligible.
Worse still is that the further development of office suites along with expensive updates for
technical / scientific versions have contributed practically nothing.
We have therefore decided in favor of a dual track: we discuss solutions to problems with
standard packages that are also applicable to open-source software (Excel, LibreOffice) and,
concerning more expensive special software for statistics, graphics, and data processing,
elaborate more upon free software.

http://iso.linuxquestions.org/ contains ISO images of all important Linux


distributions. The data can be downloaded, burned to CD-ROM and
used as an installation medium.
http://www.ubuntu.com/ is the Linux we prefer it has a very good
compatibility and is very user friendly. You can install it within a
Windows system or run it from a CD to test it. Further information,
literature and Internet addresses
2 Bas i cs with R
R is fully documented, there are many of tutorials for beginners but also quite advanced
manuals for special problems in e.g. regression or time series analysis. This is why we limit
this introduction to a basic practical work session where you can see how things are
handled in R.
In our course, this session is the first session with R. In practice, you may have checked your
data set already with a spreadsheet and are now read to start with the real work. It is also
quite common to check things out with a spreadsheet and then transfer the whole process to
R where it can be automated for future use. This is why we repeat here things we did
already with a spreadsheet, but we are sure that you will soon prefer the minimalistic
beauty of an R command line over the silly and redundant mouse clicking of the un-
initiated population.
The structure of this chapter follows strictly the work flow of a typical session in R (see Fig.
1). We will explain things when they appear first in the workflow even if this breaks
sometimes the logic of the program and/or the interface.
If you google for solutions of problems discussed in this book you might get very confused
because the internet solution is completely different from our procedure. This is one of the
disadvantages of open source software: if someone gets upset enough about something, he
can always publish a better solution. This happened quite frequently to R, especially for
graphics and data management. There are at least three different graphics subsystems, all
with a different philosophy, grammar, look. Sometimes, not even x and y have the same
position in the different procedures. For this book we try to use the modern ggplot2
library and go back to the older ones if needed.
In data management, this situation was worse. Only in the last year something as a common
denominator came up with the dplyr library. Unfortunately, this library is hard to master
for a beginner and changes quite fast. We try to use it as often as possible because it offers a
very consistent interface to data management and is similar to other database languages
like SQL.
Prepare data
Structure
Missing values
Reformat

Import data

Control and clean Import

Data types
Structure
Missing values
Check extreme values
Compute date/time

Summarize data - stats


Annual summary
Factor summaries
Frequencies

Summarize data - plots


Time series
Xy (Scatter)
Boxplots

Advanced statistics
2.1 Installation

2.1.1 Base System


In Windows you have to download the program from www.r-project.org and install it.
In Linux (Ubuntu and other Debian System) you can install R together with a user interface
from the Software center or directly from the command line with
sudo apt-get install r-cran-rcmdr
The version from r-project.org is usually newer, but the version in the repositories is
better supported.

2.1.2 User Interfaces


During the rest of this course we will use the basic R installation with Rcmdr and Rstudio
as additional user interfaces. Each program has its fans, out personal experiences are
Rcmdr is suitable for absolute beginners because you can use the menus to carry out
the first steps.
Rstudio is the software we use for daily work, it is a modern interface/editor for a
programming language but you have to type everything manually. You can control
plots, contents of variables directly.

2.1.2.1 Rcmdr
To install the Rcmdr interface, use the following commands
select Packages->Install Packages from the R-Gui.

If you never worked before with packages, R will ask you which mirror file server it should
use select the one close to you or in the same network (e.g. Gttingen for the German
Universities). Next, select all packages of Rcmdr as shown in Fig. 2 and wait for the
installation to finish. After the installation you should first start the interface with
library(Rcmdr)

Before it starts it will load other additional packages from the internet. After this process,
Rcmdr will come up (Fig. 3) and is ready for work.
For the first steps in R we recommend Rcmdr, because it helps you to import data files and
builds commands for you. If you are more familiar with R you can switch to Rstudio, which
is the more modern GUI.
We will use the successful import of our data to R (Fig. 4) as an introduction to the basic
philosophy of Rcmdr. The program window consists of three parts: the script window, the
output window and the message wid RStudndow.
The script window contains the command sent by Rcmdr to R. This is the easiest way
to study how R works, because Rcmdr translates all commands you select with the
mouse in the interface to proper R code. You can also type in any command
manually. To submit a command or a marked block you have to click on the submit
button.
The output window shows the results of the operation you just submitted. If you type
in the command 3+4 in the script window and submit the line, R confirms the
command line and prints out the result (7).
The message window shows you useful information, e.g. the size of the database we
just imported.

Much of the power of R comes from a clever combination of the script windows and the file
menu shown in Fig. 5.

The commands dealing with the script file save or load the commands contained in a
simple text file. This means that all commands you or Rcmdr issues in one session can be
saved in a script file for later use. If you e.g. put together a really complex figure you can
save the script and repeat it whenever you need it with a different data set.
The same procedure can be used for the internal memory, called workspace in R. It
contains all variables in memory, you can save it before you quit your session and R reloads
it automatically next time. If you want to reload it manually you can use the original R
interface or load it from the data menu.

2.1.2.2 Rstudio
The Rstudio GUI (http://www.rstudio.com) has to be installed as any other windows
program (Fig. 6).
For Ubuntu-Linux you also have to download and install the software from the website, it is
not part of the software repository.
2.2 The Hambu
The data set we use in our course is a climate data set from the Hamburg station, starting in
the year 1891. The data set is part of the free climate data sets and you can download the
last version from the (horribly structured) website of the German Weather Service at
http://www.dwd.de/. (see Fig. 7 for a screenshot of the download location). The contents of
the download are shown in Fig. 8. You need only the file marked with the red circle, all
other files are documentation (in German). Rename the data file to something reasonable
like climate.txt (see Fig. 10). Do not worry about the German description of the
variables, we will change them immediately after the import.
2.3 Import of dat
For a good start of this lesson we need the data base and an interface to the R program.
The structure of the data base is shown in Fig. 10, but you can use any climate data set.

Column (variable) names in spreadsheets should not contain


spaces or special characters (Umlaut etc.).
Do not mix text and number (e.g. to mark missing values). Use
-99 or similar code for missing values in numeric columns

2.3.1 Recommended input format


Before you think about importing data from worksheets you should control the structure of
the data base in the worksheet.

Make sure that


the data base has a proper, rectangular structure
Delete all mixed data columns, use numeric codes for missing
values (-999), avoid text like NA. Empty cells are ok.
Control variable/column names (no special characters, no
spaces and/or operator symbols (+-)

Fig. 9 shows some common problems in spreadsheet files which cause problems later on in
R. First, check that there is only one, rectangular matrix in a worksheet. Remove all old
intermediate steps like the ones shown in columns E-G in Fig. 9. Check also the lower end of
the spreadsheet for sums and other grouping lines. Second, check the variable/column
names. They should only contain good old fashioned ASCII-characters, no spaces (Fig. 9),
umlauts, operator symbols and other characters which have a special meaning (/,().
Third, check the columns for text, especially text used to define missing values (-) etc.
2.3.2 Import of text data (csv, ASCII) with Rcmdr
The first step of an analysis is to import the data. Usually, the data set is already available as
spreadsheet or text file and you need only the commands shown in Fig. 11.
The data set must have a rectangular form without empty rows or
columns

If you import the data set from the clipboard you should take care to fill out properly the
fields marked in red in Fig. 12, especially the decimal-point character and the field
separator.

2.3.3 Direct import of Excel Files


There are several so called libraries to import Worksheets directly, but many of the have
their problems and work only with the outdated 32bit version. We had the best experiences
with the gdata library. Unfortunately you have to install another software/programming
language to get it working, the Perl environment. On Windows machines you can download
and install Active Perl from www.activestate.com/.
First, you have to load the package used for import of spreadsheets.
library(gdata) # xls import
If you get an error message about Missing perl interpreter you did not install the
required software.
The proper import process is quite simple. To read the first sheet from the worksheet you
type:
climate = read.xls("climate_import.xlsx", sheet=1)
The next step is quite essential: control the structure of the import file with
str(climate)
Fig. 13 shows the correct output after a successful import. First, you should control the
number of variables and rows (observations). Second, control the data type of the variables.
In our case, they should be numeric, i.e. numeric or integer. Other types like character or
factor occur if there is text in at least one row of the file. In this case you should go back to
chapter 2.3.3. Do not worry about the Factor type for columns with date and time, we will
deal with this problem in the next chapter.
2.3.4 Import with Rstudio

Import of data in CSV format is also available in Rstudio. Figure 14 shows the import
function. The available options are the same as in Rcmdr.

2.3.5 First data checks, basic commands


After the import of the data set you should first check some basic things: the variable types
(structure) with
str(Climate)
The results should look like Fig. 15 if you used the import function of Rcmdr. The whole file
is converted to a variable of type data.frame which consists of several sub-variable
corresponding to the columns of the file. The data type of all variables is numeric this is
the normal view.
In case you need some help you have different choices. In case you know the name of a
command, you get help with
help(ls)
You can search in the help files with
??plot
Other useful commands:
ls()
lists all variables currently in memory.
rm()
removes a variable.
edit(Climate) or View(Climate)
lets you control your data set and change or view values.
names(Climate)
Lists the names of the sub-variables (columns) contained in a variable
Climate
If you type the name of a variable, the contents are displayed

Fig. 16 shows a frequent problem: If the imported file contains not only numbers, but also
text (see left side, text Missing), then the whole column is converted to a factor variable,
i.e. the variable cannot be used for computation, only for classification.
2.4 Working with vari ables

2.4.1 Variable types and conversions


In spreadsheets you can mix variable types as you like, e.g. text and numbers. For statistical
analysis this does not make sense: you cannot calculate a mean values between numbers
and text. This is why statistical programs (and data bases) are more strict when it comes to
the types of variables. The following types of variables are quite common in programming
languages and R
real numbers (1.33)
integer numbers (1.2.3.4.....)
boolean (yes/no, o/1)
text (This is a text)
character (A, b, but also 1, 2)
Another important type in R is a factor variable. You can think of it as a header in a pivot
table, it is used for classification of values. A factor variable can be a text like Forest or a
number (e.g. a year).

2.4.2 Accessing variables


Frequently we do not use the whole data set but only parts of it. The following examples
show, how to use parts of the climate data set from Fig. 15. The commands look simple at
first sight, but they are able to replace a data base and can even replace advanced filters in
Excel.
Climate$AirTemp_Max
Displays the content of the AirTemp_Max column.c
The output of the next column is quite obvious, you select rows and columns by numbers.
Climate[1,1]
First value of first line
Climate[1,"Station"]
Same as before, but with variable name instead on number
Climate[1,]
All values of first record
Climate[,1]
All values of first variable

The next commands are hard to understand at first sight, but they are a cause of the
unmatched elegance and flexibility of R.
Climate[-1,]
All values except the first line
Climate[1:10,]
The first ten lines
Climate[-1,]
The first ten lines of columns 2-4, 7 and 9. The expression c() creates a list most
commands accept it as input.
Climate[Climate$AirTemp_Max>35,]
Get only data sets with AirTemp_Max>35
Climate$AirTemp_Max[(Climate$AirTemp_Max>19 &
Climate$AirTemp_Max<20)]
Get values between 19 AND 20 of Max_Temp. The logical OR condition is handled by
operator |.
If you want to keep the results and save it in a variable you can use the = operator.
climax = Climate[(Climate$AirTemp_Max>19 &
Climate$AirTemp_Max<20),]
Creates a new variable Climax with the contents of the selection.
If you do not want to type the name of the data matrix each time you need a variable, you
can use
attach(Climate)
to make variables inside a data matrix visible. After the attach command, the command
AirTemp_Max[(AirTemp_Max>19 & AirTemp_Max<20)]
will list the same result as the command above with full names. Another (politically correct)
method to access variables inside data from it the use the with function
with(Climate, AirTemp_Max[(AirTemp_Max>19 & AirTemp_Max<20)])
2.4.3 Coding date and time
The format of the date in Germany is usually in the form DD.MM.YYYY, in international publications it
is in ANSI-Form written as YYYY.MM.DD. This text-formatted dates are usually converted into
numbers. Normally, the day-count is the integer part of the coded number, the decimal fraction
represents the time of the day as the fractional day since midnight. What makes handling of date values
difficult is that different programs use different base values for the day-count. ANSI e.g. uses 1601-01-01
as day no. 1 while some spreadsheets use 1900-01-01 on PC and 1904-01-01 on Mac computers. It is
therefore highly recommended to use the text format for data exchange. The commands are explained
in chapter 11.3.1 on page 110.
For our climate data set we need real date, we have to convert the input to internal date
values.
Climate$Meas_Date=as.character(Climate$Meas_Date)
A conversion of the integer variable to text makes it easier to create the date.
Climate$Date= as.Date(Climate$Meas_Date, "%Y%m%d")
Convert the text to a real date. See the help for a complete list of all format options.
Climate$Year = format(Climate$Date, "%Y")
Extract years from the date we need this information later for annual values.
# check data type!
Climate$Year = as.numeric(Climate$Year)
The format function returns text, we convert it back to a number.
Climate$Month = format(Climate$Date, "%m")
Climate$Month = as.numeric(Climate$Month)
Climate$Dayno=Climate$Date-
as.Date(paste(Climate$Year,"/1/1"),"%Y /%m/%d")+1
In ecology we frequently need the daynumber from 1 to 365. We get it in R by subtracting
the value of 1st of January from the current date.

2.5 Simple
R has three different graphic subsystems. For a first overview we recommend the new
ggplot2 library.
library(ggplot2)

qplot(Date,AirTemp_Mean,data=Climate)
In case you do net specify the type of figure you want, ggplot2 make a guess.
qplot(Date,AirTemp_Mean,data=Climate,geom="line")
The geom parameter defines the type of figure you want have. In this case line is a good
choice.
qplot(Year,AirTemp_Mean,data=Climate,geom="boxplot")
qplot(as.factor(Year),AirTemp_Mean,data=Climate,geom="boxplot")
Some command cannot handle all data types, here we have to convert the numeric variable
Year to a factor variable.
qplot(as.factor(Month),AirTemp_Mean,data=Climate,geom="boxplot")
Boxplots are not always the best method to display data. If the distribution could be
clustered, the jiiter type is a good alternative. It displays all points of a data set.
qplot(as.factor(Month),AirTemp_Mean,data=Climate,geom="jitter")
One of the advantages of the qplot command is that you can use colours and matrix plots
out of the box.
qplot(as.factor(Month),AirTemp_Mean,data=Climate,geom="jitter",col
=Year)
qplot(as.factor(Year),AirTemp_Mean,data=Climate,geom="jitter",col=
Month)
It can be quite useful to plot data sets in annual or monthly subplots, with the facets
option you can plot one- or two-dimensional matrices of plots.
qplot(as.factor(Year),AirTemp_Mean,data=Climate,geom="jitter",face
ts= Month ~ .)
qplot(as.factor(Year),AirTemp_Mean,data=Climate,geom="line",facets
= Month ~ .)

qplot(as.factor(Dayno),AirTemp_Mean,data=Climate,geom="jitter",col
=Month)
qplot(as.factor(Dayno),AirTemp_Mean,data=Climate,geom="jitter",col
=Year)

2.6 Tasks
1: calculate a variable Climate$Summer where summer=1 and
winter = 0
2: Plot the summer and winter temperatures in a
boxplot

To convert on type to another, there are as.xxx() functions. To convert our


summer/winter classification to a factor variable we could type
Climate$Summer = as.factor(Climate$Summer)

3: create new factor variables for year and month. Do not


replace the original values, we will need them later.

To convert a factor variable back to its old value you need to


convert it first do a text-variable

Climate$Year = as.numeric(as.character(Climate$Year))
If you forget this not really obvious step, you get the number of index, not of the value of a variable.

4: create an additional variable for groups with 50


years. Use the facets and color command to
check and display the temperatures.

2.7 Debugging R scripts


Many beginners are frustrated by Rs error messages, but with some patience and trust in R
you will get over it. Especially if you start to to use R, the first rule is:
The problem always sits in front of the screen.
Normally, R is only doing what you ask for, the problem is that your commands are broken.
The positive side is that you will learn a lot if you try to find errors on your own without
whining at your instructor that R is broken. Usually, R is not broken.
There are few rules which help you over the first problems
1. Check spelling of variable names. R is case sensitive. Year is a different variable
than year.
2. Check and correct the first error messages
3. Check variable types. If a variable is of the factor type you cannot compute
anything with it and you cannot use a numeric variable to group boxplots.
4. Execute the code line by line and check the results
5. Google the error message often helps.
3 Dat
reshap
During many years of the development of R, different standards for data management have
emerged. There are a lot of different solutions for single problems all with a different
syntax and philosophy. As always, if confusion is highest, a redeemer appeared and gave us
dplyr, the one and only interface for data management in R. You can replace nearly all
features of dplyr with other (old) R-functions, but dplyr offers a clean and easy to
understand interface to data.

3.1 Da
Many data evaluations fail already in the preparatory stages and with them often a
hopeful junior scientist. It's one of the most moving scenes in the life of a scientist when a
USB stick or notebook computer with freshly processed data (or data deleted in the logger)
sinks in a swamp, ditch, or lake.
Taking heed of our own painful experience, we've placed a chapter before the actual focal
point of the book in which data organization and data safeguarding are gone over. Along
with it there's also a short overview of the vexing set of problems associated with various
date and time formats when working with time series.

3.1.1 Optimum structure of data bases


You can avoid many problems if you create well structured data base. This means normally:
one case per record. Everything else may consume less space but you will have to
restructure the whole thing if you need a different analysis. Some common errors are
explained below. Figure 17 shows time series. Sometimes the daily values are arranged in
horizontal direction and months in vertical direction (left part). With this structure, you
run already into problems if you want to create monthly values because different months
have a different number of days. The right side version looks more complicated at first
sight, but is in fact a more elegant and useful structure: you can create monthly values with
one single command. Figure 18 shows a similar data base with lab data: the repetitions are
located in horizontal direction, the sample number in vertical direction. Again, putting only
one sample in a record facilitates later analysis.
3.1.2 Dealing with Missing Values
A gap-free storage record is as rare as a winning lottery ticket, and yet the majority of
practical applications and many computer models require a gap-free storage record. The
methods used to touch up or fill in a storage record are complex and often apply only to a
specific field or only to one single variable.

3.2 Basic
In short, dplyr offers you the basic functions of data management
filter: select parts of the data set defined by filter conditions
select variables of a data set
arrange (sort) data sets
mutate: change values, calculate and create new variables
group: divide the data set in groups (e.g. years, months)
summarise: calculate mean, sums etc. for groups
combine/join two data bases
The use of the filter function is straightforward:
t5 = filter(Climate, AirTemp_Mean>5)
The select function also works as expected with names and numbers
t6 = select(t5, Date,Year,Month,Dayno,AirTemp_Mean)
You can also use the index of columns, but the version with names is more readable and
avoids problems if columns are deleted
t6 = select(t5, c(18:21,4))
You can use the arrange function to select the 100 hottest days in the data set and look for
signs of global change
t7=arrange(t6,desc(AirTemp_Mean))

5: Select the 100 hottest days and plot the temporal distribution as
a histogram in groups of 10 years
6: Select the 100 coldest days and plot the temporal distribution as
a histogram in groups of 10 years

You can easily use any R function to change values, but the politically correct way is to use
the mutate function. A common method in meteorology is to use the average of minimum
and maximum temperature as a replacement for the mean temperature.
t11 = mutate(Climate,New_Mean=(AirTemp_Max+AirTemp_Min)/2)
In pure R you get the same result with
t11$New_Mean = (Climate$AirTemp_Max+Climate$AirTemp_Min)/2)

7: Draw a figure (scatterplot) with New_Mean and the measured


mean.

The most common application for dplyr is the calculation of mean, sums etc., e.g. for
annual and monthly values. The first step is to create groups
clim_group=group_by(Climate, Year)
With this grouped data set you can use any function to calculate new values based on the
grouping
airtemp=summarise(clim_group,mean=mean(AirTemp_Mean),
median=median(AirTemp_Mean))

8: Calculate and draw monthly mean values


3.3 Reshaping dat
Much of the power of R in data management comes from a clever combination of dplyr,
reshape2 and ggplot2.
The reshape2 is a utility to convert between narrow and wide format. Both formats are
used frequently. For all ggplot2 functions, the narrow format of a data set is a better
choice.
library(reshape2)
id=c("Year","Month","Dayno","Date")
measure=c("AirTemp_Mean","AirTemp_Max","AirTemp_Min","Prec","Hum_Rel")
Clim_melt=melt(Climate,id.vars=id,measure.vars=measure)
In figure 19 you can see the structure of the new data set. In the melt function, two
parameters are important: id-variables and measure variables. The id-variables remain
unchanged and are used as an index. Date variables are typical id variables. The columns of
the measure variables are collapsed in the variable and value column of the new file.
One line of the original, wide data set is now converted to 5 lines in the narrow data set.

Now, its easy to create a quite complex figure with a simple command. Please note the +
as the last character in the line, we need it to continue the graphic command.
qplot(Dayno,value,data=Clim_melt) +
facet_grid(variable ~ .,scales="free")

3.4 Merging
To demonstrate how to merge two data bases we use a different data set. The
administration of the Pln district supervises a monitoring program of all lakes in the Pln
district (Edith Reck-Mieth). We have a data base of the annual measurements of chemical
properties and a data base of the static properties of the lakes like depth, area etc.
First, we read the two data bases and check content and structure.
chemie <- read.csv("chemie.csv", sep=";", dec=",")
stations <- read.csv("stations.csv", sep=";", dec=",")
In the second step we merge the two data bases. We can use the old style command
# join the two data bases
chem_all=merge(chemie,stations,by.x="Scode")
In dplyr syntax, the same result is produced by
chem_all2=inner_join(chemie,stations,by="Scode")

9: Check if there is a relation between lake depth, area and a


chemical variable of your choice.

Remember that you have a lot of choices to code values in


ggplot2
size
col
facets
linetypes/point types
4 Exploratory
The expression exploratory data analysis (EDA) goes back to Tukey 1977. It is an approach to
analyzing data sets to summarize their main characteristics in an easy-to-understand form,
often with visual graphs. The main purpose is to get a feeling of the data set similar to a
first exploratory walk in an unknown city.
We will also use this chapter to introduce the three different R ways of creating figures. As
mentioned earlier, R has three different graphic subsystems: ggplot2, lattice and the
original base system unfortunately they are not compatible and unfortunately there
are few unique features in each system. This problem is similar with dplyr: sometimes,
older functions are easier to use and are more widespread.
To shorten the data base a little bit we can use the following lines to shorten our data set to
10 years and create some factors.
clim2000 = Climate[Climate$Year >= 2000,]
str(clim2000)
# calculate/Update factor variables
clim2000$Summer = as.factor(clim2000$Summer)
clim2000$Year_fac = as.factor(clim2000$Year)
clim2000$Month_fac = as.factor(clim2000$Month)

A frequent source of errors is an earlier definition of a figure, which


is still valid, a good start for each test is to switch off and reset
old devices settings with
dev.off()
From now on we will not explain all options in a command, please
refer to the help function for more information, e.g.
help(max) or
?max

Murrell, Paul, 2006: R Graphics, Computer Science and Data


Analysis Series, Chapman & Hall/CRC, 291p
Tukey, John Wilder, 1977: Exp
An
Tufte, Edward (1983), The Visual Display of Quantitative Information,
Graphics Press.
http://www.itl.nist.gov/div898/handbook/eda/eda.htm, online
textbook
Engineering Statistics Handbook: Exploratory Data Analysis

4.1 Simple numeri c ana


The obvious method to get a summary of the whole data set is
summary(Climate)
If you want a first overview, a pivot table is always a good choice if you work with
spreadsheets. With R you can get a table with monthly and annual sums with
xtabs(Climate$Prec ~ Climate$Year + Climate$Month)
The syntax of this command, especially the selection of the variable is quite typical for
many other functions. Prec ~ Year_fac + Month_fac means: analyze the data
variable Prec and classify it with the monthly and yearly factor variables.
You can calculate the same result with the dplyr library
mgroup=group_by(Climate,Year,Month)
msum=summarize(mgroup, sum=sum(Prec))
msum=data.frame(msum)
mmonth=dcast(msum,Year~Month)

4.2 Simple graphic


The best plots for an overview of the data set is a scatterplot and a boxplot. If you want an
overview of monthly mean temperature you can type old style
boxplot(Climate$AirTemp_Mean ~ Climate$Month_fac, ylab="Temp.")
or use ggplot2
qplot(Month_fac,AirTemp_Mean,data=Climate,geom="boxplot")
It is also never a bad idea to plot a histogram with a frequency distribution
hist(Climate$Prec)
qplot(Prec,data=Climate,geom="histogram")

The lattice library contains a lot of useful chart types, e.g. dotplots

library(lattice)
dotplot(Mean_Temp~Year_fac)
# also available in ggplot2
qplot(Year_fac,AirTemp_Mean,data=clim2000)
qplot(Year_fac,AirTemp_Mean,data=clim2000,geom="jitter")

A nice version of a boxplot is a violinplot


library(vioplot)
vioplot(Mean_Temp, Max_Temp, Min_Temp,names=c("Mean","Max","Min"))
library(violinmplot)
violinmplot( Year_fac ~ prec, data=Climate )
violinmplot( Year_fac ~ Mean_Temp, data=Climate )
There is also a ggplot2 version
qplot(data=clim2000,x=Year_fac,y=AirTemp_Mean,
geom="violin")

4.3 Line and scatter plots


Type in the following commands and watch how the figure changes
attach(clim2000)
plot(AirTemp_Mean)
plot(AirTemp_Mean, type="l")
plot(AirTemp_Mean, type="l", col="red")
plot(AirTemp_Mean ~ Date, type="l", col="red")
plot(AirTemp_Mean ~ Date, type="l", col="red",
ylab="Temperatur", xlab="Day")

A scatterplot is a version of a line plot with symbols instead of lines. It is a very common
type used for later regression analysis. For a ggplot2 version of this figures see 4.6.1)
plot(AirTemp_Max, AirTemp_Min)
abline(0,1)
abline(0,0)
abline(lm(AirTemp_Min ~ AirTemp_Max), col="red")
lines(AirTemp_Min,AirTemp_Mean,col="green", type="p")
abline(lm(AirTemp_Max ~ AirTemp_Min), col="green")

There are also some new packages with more advanced functions. Try e.g.
library(car)
scatterplot(AirTemp_Max ~ AirTemp_Min | Year_fac)

For really big datasets the following functions can quite useful
library(IDPmisc)
iplot(AirTemp_Min, AirTemp_Max)
or
library(hexbin)
bin = hexbin( AirTemp_Min, AirTemp_Max,xbins=50)
plot(bin)
or
with(Climate,smoothScatter( AirTemp_Mean,AirTemp_Max))
or with ggplot2
qplot(data=Climate,AirTemp_Mean,AirTemp_Max,
geom="bin2d")
qplot(data=Climate,AirTemp_Mean,AirTemp_Max,
geom="hex")
If you do not like the boring blue colours, you can change then to rainbow patterns
qplot(data=Climate,AirTemp_Mean,AirTemp_Max)+
stat_bin2d(bins = 200)+
scale_fill_gradientn(limits=c(0,50), breaks=seq(0, 40, by=10),
colours=rainbow(4))
plot() op new fig
lines() comm and adds a l ine to an exi
abline() draws a s

10: p tim
sy
11: p
reshap
12:
13: ca
function.

14:
comp

4.4 Combine
For more complex and combined figures there are basically two choices in R: an easy to
understand matrix approach where all subfigures have the same size and a complex
approach where you can place your figures free on a grid.

4.4.1 Fig
For a figure with 4 elements (2 rows, 2 columns) we write
par(mfrow=c(2,2))
plot(AirTemp_Mean ~ Date, type="l", col="red", main="Fig 1")
plot(AirTemp_Max ~ Date, type="l", col="red", main="Fig 2")
plot(Prec ~ Date, type="l", col="red", main="Fig 3")
plot(Hum_Rel ~ Date, type="l", col="red", main="Fig 4")

4.4.2 Nes
A similar effect is produced with
split.screen(c (2, 2) )
screen(3)
plot(Prec ~ Date, type="l", col="red", main="Fig 3")
screen(1)
plot(AirTemp_Mean ~ Date, type="l", col="red", main="Fig 1")
screen(4)
plot(Hum_Rel ~ Date, type="l", col="red", main="Fig 4")

Here, screens can be addressed separately by their numbers. It is also possible to nest
screens. Screen(2) is split into one row and two columns which get screen number 5 and 6.
split.screen( figs = c( 1, 2 ), screen = 2 )
screen(5)
plot(Prec ~ Date, type="l", col="red", main="Fig 5 inside 2")
screen(6)
plot(Sunshine ~ Date, type="l", col="red", main="Fig 6 inside 2")
close.screen(all=TRUE
The result should look like Fig. 20.

Figure 20: Combining figures with the split.screen() command

4.4.3 Free definition of fig


The most complicated, but also the most flexible definition of combined figures is the
layout function.
The basic idea is shown in Fig. 21. The command
layout(matrix(c(1,1,1,2,2,0,3,0,0), 3, 3, byrow = TRUE))
defines a 3x3 matrix with 9 elements. The matrix command assigns each cell of this
matrix to a figure. Thus, the first three elements (the first line) of the matrix are assigned to
(sub-)figure 1. The second line contains subfigure 2 in two elements, the last element is left
free (0). In line 3, only the first cell is assigned to figure 3. The system is shown in Fig. 21.
You can control the layout with
layout.show(3)
The figure is filled with
plot(AirTemp_Mean ~ Date, type="l", col="red", main="Fig 1")
plot(AirTemp_Max ~ Date, type="l", col="red", main="Fig 2")
plot(Prec ~ Date, type="l", col="red", main="Fig 3")
The results are shown in Fig. 22. For this course we kept the structure of the matrix quite
simple. You can use as many elements as you want and order the figures in any order you
want.

Fig 1
Mean_Temp

10
-10

37000 38000 39000 40000

Date

Fig 2
Max_Temp

20
0

37000 38000 39000 40000

Date

Fig 3
40
Prec

20
0

37000 39000

Date
http://gallery.r-enthusiasts.com/ (probably down)
A nice selection of what is possible with graphics functions of R
http://research.stowers-institute.org/efg/R/
Another selection with more basic figures

4.4.4 Presenting simulation results


As a kind of final example we show you how to create a very common figure in hydrological
modeling as shown in Fig. 38: we compare the simulated and observed values and add
precipitation on top of the discharge.
Xiangxi <-
read.table("Xiangxi.txt", header=TRUE, sep=",", na.strings="NA",
dec=".", strip.white=TRUE)
attach(Xiangxi)
DateCal = as.POSIXct(DateCal)
par(mar=c(5,5,2,5))
plot(DateCal, QsimCal, ylab="Streamflow [m^3/s]", xlab="Date", type =
"l", col="green", ylim=(c(0,1200)))
lines(DateCal, QobsCal, col="black")
par(new=T)
plot(DateCal, Precip, xlab="", ylab="", col="red", type="n", axes=F,
ylim=rev(c(0,120)))
lines(DateCal, Precip, col="red",lty=3)
axis(4)
mtext("Rain (mm)", side=4, line=3 )
Figure 23: Common display of hydrological simulation results

15: create a combined figure with

1
s

2nd
3
r

d
row: two scatterplots with 1) min vs. max temperature and 2) the mean temperature vs.

4.4.5 Combined figures with the lattice package


R has three different graphic systems: the base system, lattice and ggplot2. Up to now
we worked mainly with the base system which is very flexible. Another frequently used
system is the lattice library.
Deepayan Sarkar, 2008: Lattice: multivariate data visualization with R.
Springer Use R! Series

Lattice is very well suited for the display of data sets with many (factor) variables, but the
syntax is different from normal figures and the display is not very flexible.
library(lattice)
First, let us start with some descriptive figures.
densityplot( ~ Climate$AirTemp_Max | Climate$Month_fac , data=Climate)
histogram( ~ AirTemp_Max | Month_fac , data=clim2000)
histogram( ~ AirTemp_Max+AirTemp_Min | Month_fac , data=clim2000)
Please note how the numeric variables (temperatures) and the factor variables are ordered.
All examples above print monthly plots of a temperature.
Scatterplots are very similar, only the definition of the variables is different:
xyplot(AirTemp_Mean ~ AirTemp_Max+AirTemp_Min | Month_fac ,
data=clim2000)
A very useful keyword is the grouping inside a figure.
xyplot(AirTemp_Mean ~ AirTemp_Max+AirTemp_Min, groups=Summer,
data=clim2000)
Here you can clearly see the different of summer and winter values.
Another useful feature is the automatic addition of a legend.
xyplot(AirTemp_Mean ~ AirTemp_Max+AirTemp_Min, groups=Summer,
auto.key=T, data=clim2000)
xyplot(AirTemp_Mean ~ AirTemp_Max+AirTemp_Min, groups=Summer,
auto.key=(list(title="Summer?")), data=clim2000)
A combination of all simple features makes it easy to get an overview of the dataset. In our
example it is quite apparent, that something went wrong in the year 2007 (Fig. 23).
xyplot(AirTemp_Mean ~ AirTemp_Max+AirTemp_Min, groups=Year_fac,
auto.key=list(title="Year",columns=7), data=clim2000)
Figure 24: Scatterplot of air temperatures with annual grouping

4.5 Brushing up

4.5.1 Setting margins


The setting of the margins is a nightmare for the beginner because there are several
possibilities neither of which are easy to grasp.
To recall the default margin setting you type
par()$mar
for inner margins and
par()$oma
for outer margins. New margins are set by
par(mar=c(4, 4, 4, 4))
The number mean margin at bottom, left, top and right.
The dimension of the number is lines, there is also a command for margins in inches (no
decimal units though).
4.5.2 Text for title and axes
Adding text (labels) for axes and main title is quite straightforward
plot(AirTemp_Max, AirTemp_Min, ylab="Minimum Temperature
[C]",xlab="Maximum Temperature [C]", main="Temperature")
If you want additional explanation in a figure you can add text in the margins outside the plot
area with
mtext("Line 1", side=2, line=1, adj=1.0, cex=1, col="green")
or inside the plot region with
text(5,5, "Plot", col="red", cex=2)
The x-y-dimensions of the text command are the same as the data set. As usual, you can
use any variable containing text e.g. for automatic annotations etc.

4.5.3 Colors
Colors in all pictures can be referred to by number or text.
plot(Max_Temp, Min_Temp, col=2)
is the same as
plot(Max_Temp, Min_Temp, col="red")
List of colors is printed with
colors()
If you want only shades of red
colors()[grep("red",colors())]

http://research.stowers-institute.org/efg/R/Color/Chart/
In depth information about colors in R and science

4.5.4 Legend
In the basic graphic system, the legends are not added automatically, you have to define
them separately like
plot(AirTemp_Max, AirTemp_Min)
lines(AirTemp_Min,AirTemp_Mean,col="green", type="p")
legend(20,-10, c("Max/Min", "Min/Mean"), col = c("black","green"), lty
= c(0,0), lwd=c(1,2), pch=c("o","o"), bty="n",merge = TRUE, bg =
'white' )
locator(1) # get the coordinates of a position in the figure
Again, the xy dimensions are the same as the data set. If you want to set the location with
the mouse you can use the following command
legend(locator(1), c("Max/Min", "Min/Mean"), col = c("black","green"),
lty = c(0,0), lwd=c(1,2), pch=c("o","o"), bty="n",merge = TRUE, bg =
'white' )
4.5.5 More than two axes
Each plot command sets the scales for the whole figure, the next plot command would
create a new figure. To avoid this, you have to create a new reference system in the same
figure.
First, we need more space on the right side of the plot and set margins for the second Y-
axis.
par(mar=c(5,5,5,5))
plot(AirTemp_Mean ~ Date, type="l", col="red", yaxt="n", ylab="")
As the y-axis is not drawn (yaxt="n") , we do it manually
axis(2, pretty(c(min(AirTemp_Mean),max(AirTemp_Mean))), col="red")
and finally add a title for the left axis
mtext("Mean Temp", side=2, line=3, col="red")
Now comes the second data set. To avoid a new figure we need to set
par(new=T)
The next lines are quite similar, except that we draw the y-axis on the right side (4).
plot(Prec ~ Date, type="l", col="green", yaxt='n', ylab="")
axis(4, pretty(c(0,max(Prec))), col="green")
mtext("Precipitation", side=4, line=3, col="green")

4.6 Plots with ggplot2


Unfortunately, the best graphic system in R is also the most complicated. Because disasters
always strike twice, ggplot2 does not work with old system for multiple figures like layout
but requires you to learn a new system from the grid package.

http://ggplot2.org/ The Website for the package


https://www.stat.auckland.ac.nz/~paul/grid/grid.html The website
explaining the grid layout package
http://shinyapps.stat.ubc.ca/r-graph-catalog/ A website with
example code and figures
Chang, W., 2012. R
graph
Wickham, H., 2009. Ggplot2 elegant graphics for data
an

library(ggplot2)

4.6.1 Simple plots


We already used the ggplot2 library a few times: the qplot function is part of the
ggplot2 library. Originally, the library was programmed to implement the Grammar of
graphics. As all other grammars, the result was quite complex and difficult to understand.
This why qplot was added it facilitates the transition from the traditional graphic
subsystem to ggplot2. However, qplot has fewer options for many functions. If you want
to change details, you normally have to move to the original ggplot version.
Because there are a lot of good introductions to ggplot2, we limit explanations in this book
to one example. If you want to know more details we recommend the book of the author
Wickham (2009), it is available for download.
The following command plots an annual time series from our lake data set with a non linear
regression line.
chem_all$Area_fac = as.factor(as.integer(chem_all$MeanDepth/2.5)*2.5)
qplot(Year,NO3.N,data=chem_all,geom=c("smooth", "point"),
facets= Area_fac~ .)
The following lines translate the command above to true ggplot. Please not that all
following code lines bwlong together and create a single figure. The + sign adds another
graphic component to a figure.
All figures start with a definition of the data base and the Aestetics, the definition of the
axes.
ggplot(chem_all, aes(x=Year, y=NO3.N)) +
# defines how the data set is displayed: point
geom_point(color = "red", size = 3) +
# adds a statistics, in this case a linear regression line
stat_smooth(method="lm") +
# create sub-plots for area
facet_wrap(~Area_fac,scales="free") +
# changes the background to white
theme_bw()

16: find out how to change size and orientation of an x axis in


gg

4.6.2 Multiple plots with ggplot2, gridExtra version

library(ggplot2)
library(grid)
library(gridExtra)
p1 <- qplot(data=clim2000,x=Month_fac,facets= . ~ Year_fac,
y=AirTemp_Mean,geom="boxplot")
p2 <- qplot(data=clim2000,x=Month_fac,facets= . ~ Year_fac,
y=Prec,geom="boxplot")
p3 <- qplot(data=clim2000,x=Month_fac,facets= . ~ Year_fac,
y=Sunshine,geom="boxplot")
p4 = qplot(data=clim2000,x=Date,y=AirTemp_Mean,geom="line")
grid.arrange(p1, p2, p3, p4, ncol = 2, main = "Main title")
dev.off()
The multiplot function delivers nearly the same result.

4.6.3 Multiple plots with ggplot2, viewport version


library(grid)
We start with a new page
grid.newpage()
Grid needs so called viewports you can use any area, here we define the lower left part of
the page

### define first plotting region (viewport)


vp1 <- viewport(x = 0, y = 0, height = 0.5, width = 0.5,
just = c("left", "bottom"), name = "lower left")
From now on, everything is drawn in the lower left part
pushViewport(vp1)
### show the plotting region (viewport extent)
### plot a plot - needs to be printed (and newpage set to FALSE)!!!
Now we define the figure. The qplot command is a simplification of the ggplot2 package,
is makes transition from old packages easier. It always requires an extra print command to
appear on the page. Now we print monthly boxplots in a separate figure for each year

bw.lattice <- qplot(data=clim2000,x=Month_fac,facets= . ~ Year_fac,


y=AirTemp_Mean,geom="boxplot")
print(bw.lattice, newpage= FALSE)

Now we move up one step in the hierarchy, all plot commands would now be printed on the
full page.
upViewport(1)
### define second plot area
vp2 <- viewport(x = 1, y = 0, height = 0.5, width = 0.5,
just = c("right", "bottom"), name = "lower right")
### enter vp2
pushViewport(vp2)
### show the plotting region (viewport extent)
### plot another plot
bw.lattice <- qplot(data=clim2000,x=Month_fac,
facets= . ~ Year_fac,y=Prec,geom="boxplot")
print(bw.lattice, newpage= FALSE)
### leave vp2
upViewport(1)
vp3 <- viewport(x = 0, y = 1, height = 0.5, width = 0.5,
just = c("left", "top"), name = "upper left")
pushViewport(vp3)
bw.lattice <- qplot(data=clim2000,x=Month_fac,
facets= . ~ Year_fac, y=Sunshine,geom="boxplot")

print(bw.lattice, newpage= FALSE)


### show the plotting region (viewport extent)
upViewport(1)
vp4 <- viewport(x = 1, y = 1, height = 0.5, width = 0.5,
just = c("right", "top"), name = "upper right")
pushViewport(vp4)
bw.lattice=qplot(data=clim2000,x=Date,y=AirTemp_Mean,geom="line")
print(bw.lattice, newpage= FALSE)
upViewport(1)

4.7 S
The easiest way to save figures produced with R is to copy them with the clipboard (copy
and paste) directly in your text or presentation or to save them with File Save as to
an image file. However if you have more than one image or if you have to do the same image
over and over, it is better so save the figures automatically in a file. You can save figures in
different formats, below we show the commands to open a file in PDF, PNG and JPG format
respectively.
pdf(file = "FDC.pdf", width=5, height=4, pointsize=1);
png("acf_catments.png", width=900) # dim in pixels
jpeg("test.jpg",width=600,height=300)
plot(AirTemp_Mean ~ Date, type="l", col="red", main="Fig 1")
All graphic devices are closed with the command
dev.off()
The recommended procedure is to develop and test a figure on screen and wrap it in a file
as soon as the results are as expected.
For the ggplot library you have to use
fig1 <- qplot(data=clim2000,x=Month_fac,y=AirTemp_Mean,geom="boxplot")
ggsave("fig1.png",width=3, height=3) # dim in cm

4.8 Scatterplot matrix plots


Scatterplot matrices belong to the bivariate methods are discussed in more detail in chapter
5. However, they are frequently used in EDA for a short and detailed overview of data sets
with correlations. A good example is our data set of lake chemistry. It is very probable that
some substances are correlated. This method is also a good way to identify outliers and
extreme values in data sets.
As always in R, there are three ways to heaven. Because they all have different unique
feature we will introduce all three here. You also need the functions printed below the
proper commend lines.
library(car)
library(lattice)
library(GGally)
We start with the grandfather of all scatterplot matrices
splom(chemie)
splom(chemie,groups=as.factor(chemie$Year))

Ggplot2 also contains a scatterplot matrix function


ggpairs(chemie, columns = 5:8)
t=chem_all[,c(6:9,35)]

ggpairs(t,columns=1:3)
You can also integrate density plot in ggpairs
ggpairs(t,columns=1:3,
upper = list(continuous = "density"),
lower = list(combo = "facetdensity"))
One of the most useful scatterplot versions is pairs, which also prints out the correlation
and the significance level. Unfortunately, pairs does not work with missing values in the
data set. This cleaning proces often removes half of the data set.
t2=t[complete.cases(t),]
pairs(t2[1:3], lower.panel=panel.smooth, upper.panel=panel.plot)

t2=t[complete.cases(t),]
pairs(t2, lower.panel=panel.smooth, upper.panel=panel.cor)

The following code is the definition of functions needed by pairs. You have to execute
them prior to the use of pairs.
panel.cor <- function(x, y, digits=2, prefix="", cex.cor)
{
usr <- par("usr"); on.exit(par(usr))
par(usr = c(0, 1, 0, 1))
r <- abs(cor(x, y))
txt <- format(c(r, 0.123456789), digits=digits)[1]
txt <- paste(prefix, txt, sep="")
if(missing(cex.cor)) cex <- 0.8/strwidth(txt)
test <- cor.test(x,y)
# borrowed from printCoefmat
Signif <- symnum(test$p.value, corr = FALSE, na = FALSE,
cutpoints = c(0, 0.001, 0.01, 0.05, 0.1, 1),
symbols = c("***", "**", "*", ".", " "))
text(0.5, 0.5, txt, cex = cex * r)
text(.8, .8, Signif, cex=cex, col=2)
}

# based mostly on http://gallery.r-enthusiasts.com/RGraphGallery.php?


graph=137
panel.plot <- function(x, y) {
usr <- par("usr"); on.exit(par(usr))
par(usr = c(0, 1, 0, 1))
ct <- cor.test(x,y)
sig <- symnum(ct$p.value, corr = FALSE, na = FALSE,
cutpoints = c(0, 0.001, 0.01, 0.05, 0.1, 1),
symbols = c("***", "**", "*", ".", " "))
r <- ct$estimate
rt <- format(r, digits=2)[1]
cex <- 0.5/strwidth(rt)

text(.5, .5, rt, cex=cex * abs(r))


text(.8, .8, sig, cex=cex, col='blue')
}
panel.smooth <- function (x, y) {
points(x, y)
abline(lm(y~x), col="red")
lines(stats::lowess(y~x), col="blue")
}

4.9 3d Images
Plotting 3d images is no problem if you have already a grid with regular spacing. The
procedure here also works with irregularly spaced data, but keep in mind that the spatial
interpolation may cause two kinds or problems:
valleys and/or mountains in the image which are not found in the data.
Information at a smaller scale than the grid size may completely disappear in
the image
For the spatial interpolation we use the package akima
install.packages("akima")
library(akima)
The data set is not a real spatial data set but a time series of soil water content at different
depth. The third dimension here is time.
g <- read.csv("soil_water.csv", header=TRUE)
attach(g)
Define range of values
x0 <- -180:0
y0 <- 0:367
ak <- interp(g$Depth, g$Day, g$SWC)
The ranges can also be defined automatically by the functions:
x0 <- min(Depth):max(Depth)
y0 <- min(Day):max(Day)
The variable ak now contains a regular grid. Now we can plot all kinds of impressing
3dimensional figures. We start with a boring contour plot:
contour(ak$x, ak$y, ak$z)
A more colourful version codes the values of the z-column with all colours of the rainbow:
image(ak$x, ak$y, ak$z, col=rainbow(50))
A similar picture comes out from
filled.contour(ak$x, ak$y, ak$z, col=rainbow(50))
If you don't like the colours of the rainbow you can also use:
'heat.colors','topo.colors','terrain.colors', 'rainbow', 'hsv', 'par'.
A 3dim view is created by:
persp(ak$x, ak$y, ak$z, expand=0.25, theta=60, phi=30, xlab="Depth",
ylab="Day", zlab="SWC", ticktype="detailed", col="lightblue")
5 Bivariate
This chapter is based on the following literature sources:
Kabacoff, R.I., 2011. R in Action - Data Analysis and Graphics with R. Manning Publications
Co., Shelter Island, NY.
http://www.manning.com/kabacoff/
Logan, M., 2010. Biostatistical Design and Analysis Using R: A Practical Guide. Wiley-
Blackwell, Chichester, West Sussex.
http://eu.wiley.com/WileyCDA/WileyTitle/productCd-1405190086.html
Trauth, M.H., 2006. MATLAB recipes for earth sciences. Springer, Berlin Heidelberg New
York.
http://www.springer.com/earth+sciences+and+geography/book/978-3-642-12761-8?
changeHeader

Bivariate analysis aims to understand the relationship between two variables x and y.
Examples are

the length and the width of a fossil


the sodium and potassium content of volcanic glass
the organic matter content along a sediment core

When the two variables are measured on the same object, x is usually identified as the
independent variable, whereas y is the dependent variable. If both variables were generated
in an experiment, the variable manipulated by the experimenter is described as the
independent variable. In some cases, both variables are not manipulated and therefore
independent. The methods of bivariate statistics help describe the strength of the
relationship between the two variables, either by a single parameter such as Pearsons
correlation coefficient for linear relationships or by an equation obtained by regression
analysis (Fig. 5-1). The equation describing the relationship between x and y can be used to
predict the y-response from arbitrary xs within the range of original data values used for
regression. This is of particular importance if one of the two parameters is difficult to
measure. Here, the relationship between the two variables is first determined by regression
analysis on a small training set of data. Then, the regression equation is used to calculate
this parameter from the first variable.

Correlation or Regression ?!
Correlation: Neither variable has been set (they are both measured) AND there is no implied
causality between the variables
Regression: Either one of the variables has been specifically set (not measured) OR there is and
implied causality between the variables whereby one variable could influence the other but the
reverse is unlikely.
The thirty data points represent the age of a sediment (in kiloyears before present) in a certain depth (in
meters) below the sediment-water interface. The joint distribution of the two variables suggests a linear
relationship between age and depth, i.e., the increase of the sediment age with depth is constant. Pearsons
correlation coefficient (explained in the text) of r = 0.96 supports the strong linear dependency of the two
variables. Linear regression yields the equation age=6.6+5.1 depth. This equation indicates an increase of the
sediment age of 5.1 kyrs per meter sediment depth (the slope of the regression line). The inverse of the slope
is the sedimentation rate of ca. 0.2 meters /kyrs. Furthermore, the equation defines the age of the sediment
surface of 6.6 kyrs (the intercept of the regression line with the y-axis). The deviation of the surface age from
zero can be attributed either to the statistical uncertainty of regression or any natural process such as
erosion or bioturbation. Whereas the assessment of the statistical uncertainty will be discussed in this
chapter, the second needs a careful evaluation of the various processes at the sediment-water interface

5.1 Pear

Correlation coefficients are often used at the exploration stage of bivariate statistics. They are
only a very rough estimate of a (recti-)linear trend in the bivariate data set. Unfortunately,
the literature is full of examples where the importance of correlation coefficients is
overestimated and outliers in the data set lead to an extremely biased estimator of the
population correlation coefficient. The most popular correlation coefficient is Pearsons
linear product-moment correlation coefficient (Fig. 5-1). We estimate the populations
correlation coefficient from the sample data, i.e., we compute the sample correlation
coefficient r, which is defined as
where n is the number of xy pairs of data points, sx and sy are the univariate standard
deviations. The numerator of Pearsons correlation coefficient is known as the corrected sum
of products of the bivariate data set. Dividing the numerator by (n1) yields the covariance
which is the summed products of deviations of the data from the sample means, divided by
(n1). The covariance is a widely-used measure in bivariate statistics, although it has the
disadvantage of depending on the dimension of the data.

Dividing the covariance by the univariate standard deviations removes this effect and leads
to Pearsons correlation coefficient r.

Pearsons correlation coefficient is very sensitive to various disturbances in the bivariate


data set. The following example illustrates the use of the correlation coefficients and
highlights the potential pitfalls when using this measure of linear trends. It also describes
the resampling methods that can be used to explore the confidence of the estimate for r.

The dataset:
The synthetic data consist of two variables, the age of a sediment in kiloyears before
present and the depth below the sediment-water interface in meters. The use of synthetic
data sets has the advantage that we fully understand the linear model behind the data.
The data are represented as two columns contained in file agedepth.txt. These data have
been generated using a series of thirty random levels (in meters) below the sediment
surface. The linear relationship age=5.6 meters+ 1.2 was used to compute noise free values
for the variable age. This is the equation of a straight line with a slope of 5.6 and an
intercept with the y-axis of 1.2. Finally, some gaussian noise of amplitude 10 was added to
the age data.

We load the data from the file agedepth.txt using the import function of Rstudio
(separator white space, decimal .)

17: plot age (x-axis) against depth (y-axis)


18: assess linearity and bivariate normality using a scatterplot with marginal boxplots

plot(x, y) opens a new figure,


library() opens/loads a specific package of R
help(name) opens the help in R with the information on a function or package
(ab) Positive and negative linear correlation, (c) random scatter without a linear
correlation, (d) an outlier causing a misleading value of r, (e) curvilinear relationship
causing a high r since the curve is close to a straight line, (f) curvilinear relationship clearly
not described by r.
Observation to exercise 17:
We observe a strong linear trend suggesting some dependency between the variables,
depth and age. This trend can be described by Pearsons correlation coefficient r, where r =1
represents a perfect positive correlation, i.e., age increases with depth, r = 0 suggests no
correlation, and r =1 indicates a perfect negative correlation.
We use the function cor.test to compute Pearsons correlation coefficient:

cor.test(~depth + age, data=agedepth)

The cor.test command has the form cor.test (~y+x, data=dataset).


If you attach a dataset, you can also use the form cor.test (x,y):
attach(agedepth)
cor.test(age, depth)
detach(agedepth)

The program's output looks like this:

Pearson's product-moment correlation


data: depth and age
t = 13.8535, df = 28, p-value = 4.685e-14
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
0.8650355 0.9684927
sample estimates:
cor
0.9341736

The value of r = 0.9342 suggests that the two variables age and depth depend on each other.

However, Pearsons correlation coefficient is highly sensitive to outliers. This can be


illustrated by the following example. Let us generate a normally-distributed cluster of thirty
(x,y) data with zero mean and standard deviation one.

x=rnorm(30, mean=0, sd=1)


y=rnorm(30, mean=0, sd=1)
plot(x,y)

As expected, the correlation coefficient of these random data is very low.

cor.test(~y + x)

Pearson's product-moment correlation


data: y and x
t = -0.5961, df = 28, p-value = 0.5559
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
-0.4539087 0.2587592
sample estimates:
cor
-0.111946
Now we introduce one single outlier to the data set, an exceptionally high (x,y)value, which
is located precisely on the one-by-one line. The correlation coefficient for the bivariate data
set including the outlier (x,y)=(5,5) is much higher than before.

x[31]=5
y[31]=5
plot(x,y)
cor.test(~y + x)
abline(lm(y ~ x), col="red")

Pearson's product-moment correlation


sample estimates:
cor
0.3777136

After increasing the absolute (x,y) values of this outlier, the correlation coefficient increases
dramatically.

x[31]=10
y[31]=10
plot(x,y)
cor.test(~y + x)
abline(lm(y ~ x), col="red")

Pearson's product-moment correlation


sample estimates:
cor
0.7266505

Still, the bivariate data set does not provide much evidence for a strong dependence.
However, the combination of the random bivariate (x,y) data with one single outlier results
in a dramatic increase of the correlation coefficient. Whereas outliers are easy to identify in
a bivariate scatter, erroneous values might be overlooked in large multivariate data sets.

abline(lm(y ~ x)) uses the basic graphical function A-B-line to add at regression trendline
based on the linear model (lm) of x and y.

Pachygrapsus crassipes striped shore crab

The dataset:
Sokal and Rohlf (1997) present an unpublished data set (L. Miller) in which the correlation
between gill weight and body of the crab (Pachygrapsus crassipes) is investigated.

Exercise 19:
a) import the crab data set (crab.scv, separator ,)
b) assess linearity and bivariate normality using a scatterplot with marginal boxplots
c) Calculate the Pearson's correlation coefficient and test H0: =0 (that the population
correlation coefficient equals zero)

5.2 Correlograms

Correlation matrices are a fundamental aspect of multivariate statistics. Which variables


under consideration are strongly related to each other and which arent? Are there clusters
of variables that relate in specific ways? As the number of variables grow, such questions
can be harder to answer. Correlograms are a relatively recent tool for visualizing the data in
correlation matrices.

The dataset:
Its easier to explain a correlogram once youve seen one. Consider the correlations among
the variables in the PMM data set. Here you have 15 variables, namely different chemical
elements measured during a XRF (X-Ray fluorescence) scan of a sediment core taken from
an Alpine peat bog (Plan da Mattun Moor, PMM) in 2010. The core had a length of 143 cm
and was scanned in 1-cm-resolution. Hence, 143 data values are available for each element.

We load the data from the file PMM.txt using the import function of RStudio (separator
white space)
You can get the correlations using the following code:

options(digits=2)
cor(PMM)

Al Si S Cl K Ca Ti Mn Fe Zn Br Rb
Al 1,0000 0,9073 -0,1005 -0,0740 0,9167 -0,2974 0,9176 0,1688 0,4443 0,4805 -0,3899 0,7344
Si 0,9073 1,0000 -0,2246 -0,2298 0,7381 -0,5379 0,8063 0,1783 0,2953 0,2885 -0,5806 0,4900
S -0,1005 -0,2246 1,0000 0,1782 -0,1026 0,5458 -0,1194 0,0003 0,5452 -0,0443 0,1718 -0,0904
Cl -0,0740 -0,2298 0,1782 1,0000 0,0828 0,5761 0,0331 0,3657 0,1702 0,3451 0,4423 0,2326
K 0,9167 0,7381 -0,1026 0,0828 1,0000 -0,1509 0,9482 0,1429 0,4794 0,6092 -0,2269 0,8973
Ca -0,2974 -0,5379 0,5458 0,5761 -0,1509 1,0000 -0,3021 0,0238 0,0442 0,0480 0,6203 0,0523
Ti 0,9176 0,8063 -0,1194 0,0331 0,9482 -0,3021 1,0000 0,1917 0,5605 0,5442 -0,3700 0,8425
Mn 0,1688 0,1783 0,0003 0,3657 0,1429 0,0238 0,1917 1,0000 0,2784 0,4044 0,0644 0,1225
Fe 0,4443 0,2953 0,5452 0,1702 0,4794 0,0442 0,5605 0,2784 1,0000 0,3898 -0,0458 0,4875
Zn 0,4805 0,2885 -0,0443 0,3451 0,6092 0,0480 0,5442 0,4044 0,3898 1,0000 0,1412 0,6466
Br -0,3899 -0,5806 0,1718 0,4423 -0,2269 0,6203 -0,3700 0,0644 -0,0458 0,1412 1,0000 -0,0314
Rb 0,7344 0,4900 -0,0904 0,2326 0,8973 0,0523 0,8425 0,1225 0,4875 0,6466 -0,0314 1,0000
Which variables are most related?
Which variables are relatively independent?
Are there any patterns?
It isnt that easy to tell from the correlation matrix without significant time and effort (and
probably a set of colored pens to make notations). You can display that same correlation
matrix using the corrgram() function in the corrgram package.

You have to install the corrgram package first!

library(corrgram)
corrgram(PMM)

Figure 5-3: Correlogram of the correlations among the variables in the PMM data frame

To interpret this graph (Fig. 5-3), start with the lower triangle of cells (the cells below the
principal diagonal). By default, a blue color and hashing that goes from lower left to upper
right represents a positive correlation between the two variables that meet at that cell.
Conversely, a red color and hashing that goes from the upper left to the lower right
represents a negative correlation. The darker and more saturated the color, the greater the
magnitude of the correlation. Weak correlations, near zero, will appear washed out.

The format of the corrgram() function is:


corrgram(x, order=, panel=, text.panel=, diag.panel=)

where x is a data frame with one observation per row. When order=TRUE, the variables are
reordered using a principal component analysis of the correlation matrix. Reordering can
help make patterns of bivariate relationships more obvious. The option panel specifies the
type of off-diagonal panels to use. Alternatively, you can use the options lower.panel and
upper.panel to choose different options below and above the main diagonal. The text.panel
and diag.panel options refer to the main diagonal.

Figure 5-4: Different correlogram layouts using the corrgram package with the variables in the PMM data frame

The dataset:
Equivalent to the PMM data set, the STY1 data set consists of geochemical data produced
by XRF scanning of a sediment core from Lake Stymphalia in Greece (Unkel et al., 2011).
Reference:
Unkel, I., Heymann, C., Nelle, O., Zagana, H., 2011. Climatic influence on Lake Stymphalia
during the last 15 000 years, In: Lambrakis, N., Stournaras, G., Katsanou, K. (Eds.), Advances
in the Research of Aquatic Environment. Springer, Berlin, Heidelberg, pp. 75-82.
Exercise 20:
a) import the STY1 data set (STY1.txt, separator=Tab)
b) plot the following element-combinations in a nested (multi-plot) figure of 4 plots,
add a title (main) and a regression line (abline) and different color in each
respective plot:
Al-Si; Ca-Sr; Ca-Si; and Mn-Fe
explain what you see.
c) Calculate the Pearson's correlation coefficient and test H0: =0 (that the population
correlation coefficient equals zero) for these for element-combinations
d) produce first an unsorted and than a sorted (order=TRUE) correlation matrix of the
entire STY1 data set, both times only displaying the lower panel as shades.

5.3 Cl ass
Linear regression provides another way of describing the dependence between the two
variables x and y. Whereas Pearsons correlation coefficient provides only a rough measure
of a linear trend, linear models obtained by regression analysis allow to predict arbitrary y
values for any given value of x within the data range. Statistical testing of the significance of
the linear model provides some insights into the quality of prediction. Classical regression
assumes that y responds to x, and the entire dispersion in the data set is in the y-value (Fig.
5-5). Then, x is the independent, regressor or predictor variable. The values of x are defined
by the experimenter and are often regarded as to be free of errors. An example is the
location x of a sample in a sediment core. The dependent variable y contains errors as its
magnitude cannot be determined accurately. Linear regression minimizes the y deviations
between the xy data points and the value predicted by the best-fit line using a least-squares
criterion. The basis equation for a general linear model is

The regression line passes through the data centroid defined by the sample means. We can
therefore compute the other regression coefficient b0,
using the univariate sample means and the slope b1 computed earlier.

Whereas classical regression minimizes the y deviations,reduced major axis


regression minimizes the triangular area 0.5(x y) between the points and the
regression line, where x and y are the distances between the predicted and the
true x and y values. The intercept of the line with the y-axis is b0, whereas the slope
is b1.These two parameters define the equation of the regression line.

5.3.1 Analyzing the Residuals


When you compare how far the predicted values are from the actual or observed values,
you are performing an analysis of the residuals. The statistics of the residuals provides
valuable information on the quality of a model fitted to the data. For instance, a significant
trend in the residuals suggests that the model not fully describes the data. In such a case, a
more complex model, such as a polynomial of a higher degree should be fitted to the data.
Residuals ideally are purely random, i.e., Gaussian distributed with zero mean. Therefore,
we can test the hypothesis that our residuals are Gaussian distributed by visual inspection
of the histogram and by employing a 2-test introduced later (chapter 6).
assessing the residual plot in R:
dataset.lm<- lm (y ~ x, dataset)
plot (dataset.lm, which = 1)
The dataset:
As part of a Ph.D. Into the effect of starvation and humidity on water loss in the confused
flour beetle (Tribolium confusum), Nelson (1964) investigated the linear relationship between
humidity and water loss by measuring the amount of water loss (mg) by nine batches of
beetles kept at different relative humidities (ranging from 0 to 93%) for a period of six days
(in: Sokal, R. and Rohlf, F.J. (1997). Biometry, 3rd edition. W.H. Freeman, San Francisco.)

Exercise 21:
a) import the Nelson data set (nelson.csv, separator=, )
b) assess linearity and bivariate normality using a scatterplot with marginal boxplots
comment: the ordinary least squares method is considered appropriate, as there is
effectively no uncertainty (error) in the predictor variable (x-values, relative
humidity)
c) fit the simple linear regression model (y=bo+bx) and examine the diagnostics:
nelson.lm <- lm(WEIGHTLOSS~HUMIDITY, nelson)
plot(nelson.lm)
B
Table from Logan (2011), figure from Trauth (2006)
mean: The most popular indicator of central tendency is the arithmetic mean, which is
the sum of all data points divided by the number of observations

median: the median is often used as an alternative measure of central tendency. The
median is the x-value which is in the middle of the data, i.e., 50% of the
observations are larger than the median and 50% are smaller. The median of a
data set sorted in ascending order is defined as

if N is even if N is odd

Quantiles are a more general way of dividing the data sample into groups containing equal
numbers of observations. For example, quartiles divide the data into four groups,
quintiles divide the observations in five groups and percentiles define one hundred
groups.

degrees of freedom : the number of values in a distribution that are free to be varied.
Null hypothesis:
A biological or research hypothesis is a concise statement about the predicted or theorized
nature of a population or populations and usually proposes that there is an effect of a
treatment (e.g. the means of two populations are different). Logically however, theories
(and thus hypothesis) cannot be proved, only disproved (falsification) and thus a null
hypothesis (Ho) is formulated to represent all possibilities except the hypothesized
prediction. For example, if the hypothesis is that there is a difference between (or
relationship among) populations, then the null hypothesis is that there is no difference or
relationship (effect). Evidence against the null hypothesis thereby provides evidence that
the hypothesis is likely to be true. The next step in hypothesis testing is to decide on an
appropriate statistic that describes the nature of population estimates in the context of the
null hypothesis taking into account the precision of estimates. For example, if the null
hypothesis is that the mean of one population is different to the mean ofanother
population, the null hypothesis is that the population means are equal. The null hypothesis
can therefore be represented mathematically as: Ho: 1=2 or equivalently: Ho 1-2=0.
6 Uni
6.1 Stud
(Chapter text based on Trauth, 2006)
The Students t distribution was first introduced by William Gosset (18761937) who needed a
distribution for small samples (Fig. 6-1, 26). W. Gosset was an Irish Guinness Brewery
employee and was not allowed to publish research results. For that reason he published his
t distribution under the pseudonym Student (Student, 1908). The probability density
function is

The single parameter of the t distribution is the degrees of freedom. In the analysis of
univariate data, this parameter is = n1, where n is the sample size. As , the t
distribution converges to the standard normal distribution. Since the t distribution
approaches the normal distribution for >30, it is not often used for distribution fi tting.
However, the t distribution is used for hypothesis testing, namely the t-test.
The Students t-test by Gossett compares the means of two distributions.
Let us assume that two independent sets of na and nb measurements that have been carried
out on the same object. For instance, several samples were taken from two different
outcrops. The t-test can be used to test the hypothesis that both samples come from the
same population, e.g., the same lithologic unit (null hypothesis) or from two different
populations (alternative hypothesis). Both, the sample and population distribution have to be
Gaussian. The variances of the two sets of measurements should be similar. Then, the
proper test statistic for the difference of two means is

where na and nb are the sample sizes, sa 2 and sb 2 are the variances of the two samples a
and b. The alternative hypothesis can be rejected if the measured t-value is lower than the critical
t-value, which depends on the degrees of freedom = na+nb2 and the significance level
. If this is the case, we cannot reject the null hypothesis without another cause. The
significance level of a test is the maximum probability of accidentally rejecting a true null
hypothesis. Note that we cannot prove the null hypothesis, in other words not guilty is not
the same as innocent.

The dataset example 1 (Logan page 142, 6A)


Ward and Quinn (1988) investigated differences in the fecundity (fertility, as measured by
egg production) of a predatory intertidal gastropod (Lepsiella vinosa) in two different
intertidal zones (mussel zone and the higher littorinid zone). (Ward, S., Quinn, G.P., 1988.
Preliminary investigations of the ecology of the intertidal predatory gastropod Lepsiella
vinosa (Lamarck) (Gastropoda Muricidae). Journal of Molluscan Studies 54, 109-117,
doi:110.1093/mollus/1054.1091.1109.)

Exercise 22:
a) import the Ward data set (ward.csv, separator=, )
b) We then asses assumptions of normality and homogeneity of variance for the null
hypothesis that the population mean egg production is the same for both littorinid
and mussel zone Lespiella:
boxplot(EGGS~ZONE, ward)
with(ward, rbind(MEAN=tapply(EGGS, ZONE, mean),
VAR=tapply(EGGS,ZONE,var)))
with(data, expr, )
is a generic function that evaluates an expression in an environment
constructed from data, possibly modifying the original data.
tapply(X, INDEX, FUN = NULL, ..., simplify = TRUE)
Applies a function to each cell of a ragged array, that is to each
(non-empty) group of values given by a unique combination of the
levels of certain factors.
Conclusions 1
there is no evidence of non-normality (boxplots not grossly asymmetrically) or
unequal variance (boxplots very similar in size and variances very similar).
Hence the simple student t-test is likely to be reliable. To test the null hypothesis
as formulated above.

t.test(EGGS~ZONE, ward, var.equal=T)

Conclusions 2
reject the null hypothesis (i.e. egg production is not the same). Egg production
was significantly greater in mussel zone than in littorinid zone.

Student, 1908. The Probable Error of a Mean. Biometrika 6, 1-25, stable


URL: http://www.jstor.org/stable/2331554.

6.2 Welsh's t Test


The separate variances t-test (Welsh's test), represents an improvement of the t-test in that
more appropriately accommodates samples with modestly unequal variances.

The dataset - example 2 (Logan page 142, 6B)


Furness and Bryant (1996) measured the metabolic rates of eight male and six female
breeding northern fulmars (see bird species) and were interested in testing the null
hypothesis (H0) that there was no difference in metabolic rate between the sexes. (Furness,
R.W., Bryant, D.M., 1996. Effect of Wind on Field Metabolic Rates of Breeding Northern
Fulmars. Ecology 77, 1181-1188, doi: 1110.2307/2265587.

Exercise 23:
a) import the furness data set (furness.csv, separator=, )
b) We then asses assumptions of normality and homogeneity of variance for the null
hypothesis that the population mean metabolic rate is the same for both male and
female fulmars.
boxplot(METRATE~SEX, furness)
with(furness, rbind(MEAN=tapply(METRATE, SEX, mean),
VAR=tapply(METRATE, SEX,var)))

Conclusions 1
Whilst there is no evidence of non-normality (boxplots not grossly
asymmetrically), variancesare a little unequal (one of the boxplots is not more
than three times smaller than the other). Hence, a separate variances t-test
(Welsh's test) is more appropriate than a pooled variances t-test (Student's test).

We perform a t-test to test the null hypothesis as described above


t.test(METRATE~SEX, furness, var.equal=F)
Conclusions 2
do not reject the null hypothesis, i.e. metabolic rate of male fulmars was not found
to differ significantly from that of females.

6.3 F-Test
(Chapter based on Trauth, 2006)
The F distribution was named after the statistician Sir Ronald Fisher (18901962). It is used for
hypothesis testing, namely for the F-test . The F distribution has a relatively complex probability
density function.

The F-test by Snedecor and Cochran (1989) compares the variances sa and sb of two distributions,
where sa2 >sb2. An example is the comparison of the natural heterogeneity of two samples based on
replicated measurements. The sample sizes na and nb should be above 30. Then, the proper test
statistic to compare variances is

The two variances are not significantly different, i.e., we reject the alternative hypothesis, if the
measured F-value is lower than the critical F-value, which depends on the degrees of freedom a=
na1 and b= nb1, respectively, and the significance level .

Function in R: var.test(y~x, data set)

24: a) perform an F-test for the Ward data set (Egg, ZONE)
b) perform an F-test for the Furness data set (METRATE, SEX)
c) create the following artificial data set and perform an F-test:
x <- rnorm(50, mean = 0, sd = 1)
y <- rnorm(50, mean = 1, sd = 1)
var.test(x, y)
now vary the standard deviation (sd) and the number of data
points and describe what you see.

6.4 c-Test Goodness of fit test

As Trauth (2006) explains it:

The 2-test introduced by Karl Pearson (1900) involves the comparison of distributions, permitting
a test that two distributions were derived from the same population. This test is independent of the
distribution that is being used. Therefore, it can be applied to test the hypothesis that the
observations were drawn from a specific theoretical distribution. Let us assume that we have a data
set that consists of 100 chemical measurements from a sandstone unit. We could use the 2-test to
test the hypothesis that these measurements can be described by a Gaussian distribution with a typical
central value and a random dispersion around. The n data are grouped in K classes, where n should
be above 30. The frequencies within the classes Ok should not be lower than four and never be zero.
Then, the proper statistic is
where Ek are the frequencies expected from the theoretical distribution. The alternative hypothesis
is that the two distributions are different. This can be rejected if the measured 2 is lower than the
critical 2, which depends on the degrees of freedom =KZ, where K is the number of classes and
Z is the number of parameters describing the theoretical distribution plus the number of variables
(for instance, Z=2+1 for the mean and the variance for a Gaussian distribution of a data set of one
variable, Z=1+1 for a Poisson distribution of one variable)

as Logan (2011) explains it:

By comparing any given sample chi-square statistic to its appropriate c2 distribution, the
probability that the observed category frequencies could have be collected from a population with a
specific ratio of frequencies (for example 3:1) can be estimated. As is the case for most hypothesis
tests, probabilities lower than 0.05 (5%) are considered unlikely and suggest that the same pie is
unlikely to have come from a population characterized by the null hypothesis. Chi-squared tests are
typically one-tailed tests focusing on the right-hand tail as we are primarily interested in the
probability of obtaining large chi-square values. Nevertheless, it is also possible to focus on the left-
hand tail so as to investigate whether the observed values are "too good to be true".

The c-distribution takes into account the expected natural variability in a population as well as the
nature of sampling (in which multiple samples should yield slightly different results). The more
categories there are, the more likely that the observed and expected values will differ. It could be
argued that when there are a large number of categories, samples in which all the observed
frequencies are very dose to the expected frequencies are a little suspicious and may represent
dishonesty on the part of the researcher.

Example 3 (Logan page 477, 16A)

Zar (1999) presented a data set that depicted the classification of 250 plants into one of four
categories on the basis of seed type (yellow smooth, yellow wrinkled, green smooth, and green
wrinkled). Zar used these data to test the null hypothesis that the samples came from a population
that had a 9:3:3:1 ratio of these seed types.

First, we create a data frame with the Zar (1999) seed data

COUNT <- c(152,39,53,6)


TYPE <-c("YellowSmooth", "YellowWrinkled", "GreenSmooth", "GreenWrinkled")
seeds <- data.frame(TYPE,COUNT)

We should convert the seeds data frame into a table. Whilst this step is not strictly necessary, it
ensures that columns in various tabular outputs have meaningful names:

seeds.xtab <- xtabs(COUNT~TYPE,seeds)

We assess the assumption of sufficient sample size (<20% of expected values <5) for the specific null
hypothesis.

chisq.test(seeds.xtab, p=c(9/16,3/16,3/16,1/16), correct=F)$exp

Conclusion 1 all expected values are greater than 5, therefore the chi-squared statistic is likely to
be a reliable approximation of the c distribution.

Now, we test the null hypothesis that the samples could have come from a population with a 9:3:3:1
seed type ratio.

chisq.test(seeds.xtab,p=c(9/16,3/16,3/16,1/16), correct=F)
Conclusion 2 reject the null hypothesis, because the probability is lower than 0,05. the samples are
unlikely to have come from a population with a 9:3:3:1 ratio.

25: a) import the example2 data set (example2.txt)


b) describe, what information is given in the data set
c) investigate the variability of the different elemental proxies
depending on the different phases by using boxplots
d) report the respective values for mean and variance for each
element and each phase
e) choose an appropriate t-test and test all elements for the null
hypothesis that the element content is the same in the Glacial and
in the Holocene sediments. Interpret the results
f) perform an F-test for Ca and Sr to test the null hypothesis that
each has the same variability in the Glacial and in the Holocene
sediments respectively.
g) calculate the Rb-Sr-ratio and add it as a new column to the
example2 data set. Plot the Rb-Sr-ratio against depth. If high ratios
indicate wet climate and low ratios dry climate, what can you read
out of the dataset?
7 Multiple
7.1 Multiple
(Chapter based on Logan 2011, chapter 9)
multiple regression is an extension of simple linear regression whereby a response variable is
modeled against a linear combination of two or more simultaneously measured continuous
predictor variables. There are two main purposes of multiple linear regression:

1. to develop a better predictive model (equation) than is possible from models based on
single independent variables
2. to investigate the relative individual effects of each of the multiple independent variables
above and beyond the effects of the other variables.

Example- Scatterplot Matrix:

import or load sample data set example2.txt

library(car)
scatterplot.matrix(~Ca+Ti+K+Rb+Sr+Mn+Fe,data=example2, diag="boxplot")

Conclusion 1 element Mn varies obviously non-normal (asymmetrical boxplot). Let us try ou, how
a scale transformation (e.g. logarithm) is changing that:

scatterplot.matrix(~Ca+Ti+K+Rb+Sr+log10(Mn)+Fe, data=example2,
diag="boxplot")

Conclusion 2 log10 transformation appears successful, no evidence of non-normality (symmetrical


boxplots)

7.2 Cu

It has become apparent from our previous analysis that a linear regression model provides a good
way of describing the scaling properties of the data. However, we may wish to check whether the
data could be equally-well described by a polynomial fit of a higher degree.

Example- Polynomial regression: (Logan 9F)

Sokal and Rohlf (1997) present an unpublished data set in which the nature of the relationship
between Lap94 allele (=group of genes) frequency in Mytilus edulis (blue mussel) and distance (in
miles) from Southport was investigated.

We import the mytilus data set using the import function of Rcmdr (mytilus.csv, separator=,)

Sokal and Rohlf (1997) transformed frequencies using angular transformations (arcsin
transformations). Hence, we also have to transform the Lap94 data using

asin(sqrt(LAP))*180/pi
We then have to show that a simple linear regression does not adequately describe the relationship
between Lap94 and distance by examining a scatterplot and a residual plot.

scatterplot
scatterplot(asin(sqrt(LAP))*180/pi ~DIST, data=mytilus)

residual plot:
plot(lm(asin(sqrt(LAP))*180/pi ~DIST, data=mytilus), which=1)

Conclusion 1 the scatterplot smoother suggests a potentially non-linear relationship and a


persisting pattern in the residuals further suggests that the linear model is inadequate for
explaining the response variable (Lap94).

We try to fit a polynomial regression (additive multiple regression) model incorporating up to the
fifth power (5th order polynomial)

Note that trends beyond a third order polynomial are unlikely to have much
biological basis and are likely to be over-fit. This is also true for most
geoscientific applications.

mytilus.lm5<- lm(asin(sqrt(LAP))*180/pi ~ DIST + I(DIST^2) +I(DIST^3) +


I(DIST^4) + I(DIST^5), mytilus)

We check the output:

typing mytilus.lm5 gives the output:

Coefficients:
(Intercept) DIST I(DIST^2) I(DIST^3) I(DIST^4)
I(DIST^5)
2.224e+01 1.049e+00 -1.517e-01 6.556e-03 -1.033e-04 5.518e-
07

examining the diagnostics by typing

plot(mytilus.lm5, which=1)

Conclusion 2 no wedge pattern of the residuals (see figure XX in chapter 8.3.1), suggesting the
homogeneity of variance and that the fitted model is appropriate.

Now, we want to examine the fit of the model with respect to the contribution of the different
powers:

anova(mytilus.lm5)

Analysis of Variance Table

Response: asin(sqrt(LAP)) * 180/pi


Df Sum Sq Mean Sq F value Pr(>F)
DIST 1 1418.37 1418.37 125.5532 2.346e-07 ***
I(DIST^2) 1 57.28 57.28 5.0701 0.04575 *
I(DIST^3) 1 85.11 85.11 7.5336 0.01907 *
I(DIST^4) 1 11.85 11.85 1.0493 0.32767
I(DIST^5) 1 15.99 15.99 1.4158 0.25915
Residuals 11 124.27 11.30
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

What is already stated as information above is here put into numbers: powers of distance beyond
a cubic (third order, x) do not make significant contributions to explain the variation of this data
set.

For evaluating the contribution of an additional power (order) we can compare the fit of higher
order models against models one lower in order.

Comparing second order against first order:

mytilus.lm1<- lm(asin(sqrt(LAP))*180/pi ~DIST, mytilus)


mytilus.lm2<- lm(asin(sqrt(LAP))*180/pi ~DIST+I(DIST^2), mytilus)
anova(mytilus.lm2, mytilus.lm1)

adding a model for third order:


mytilus.lm3<- lm(asin(sqrt(LAP))*180/pi ~DIST+I(DIST^2)+I(DIST^3),
mytilus)

Comparing second order against third order:


anova(mytilus.lm3, mytilus.lm2)

Conclusion 3 the third order model (lm3) fits the data significantly better that a second order
model (lm2) (P=0.018), while the second order model is not really better than a linear model (lm1)
(P=0.087).

Hence, we focus on the third order model and estimate the model parameters from the summary:

summary (mytilus.lm3)

Call:
lm(formula = asin(sqrt(LAP)) * 180/pi ~ DIST + I(DIST^2) + I(DIST^3),
data = mytilus)

Residuals:
Min 1Q Median 3Q Max
-6.1661 -2.1360 -0.3908 1.9016 6.0079

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 26.2232524 3.4126910 7.684 3.47e-06 ***
DIST -0.9440845 0.4220118 -2.237 0.04343 *
I(DIST^2) 0.0421452 0.0138001 3.054 0.00923 **
I(DIST^3) -0.0003502 0.0001299 -2.697 0.01830 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3.421 on 13 degrees of freedom


Multiple R-squared: 0.9112, Adjusted R-squared: 0.8907
F-statistic: 44.46 on 3 and 13 DF, p-value: 4.268e-07

Coefficient of determination
r is euqal to the square of the correlation coefficient only within simple
linear regression
r = (SSreg / Sstot) is reflecting the explained variance

Conclusion 4 there was a significant cubic (third order) relationship between the frequency of the
Lat94 allele and the distance from Southport. The final equation of the regression is:

arcsin(sqrt(LAT))=26.2233-0.944*DIST+0.042*dist-0.0003*dist

We can now construct a summary figure:

plot(asin(sqrt(LAP))*180/pi ~ DIST, data=mytilus,pch=16, axes=F,


xlab="", ylab="")
axis(1, cex.axis=.8)
mtext(text=expression(paste("Miles east of Southport, Connecticut")),
side=1, line=3)
axis(2, las=1)
mtext(text=expression(paste("Arcsin ",sqrt(paste("freq. of allele ",
italic("Lap"))^{94}))), side=2, line=3)
x<-seq(0,80,l=1000)
points(x,predict(mytilus.lm3, data.frame(DIST=x)), type="l")
box(bty="l")
Example 8 (adapted from Trauth, 2006)

The data set consist of synthetic data resembling the barium content (in wt.%) down a sediment
core (in meters)

26: a) import the bariumcont data set (bariumcont.txt)


b) investigate what type of regression would fit the relation between
barium content and sediment depth best by using appropriate
plotting tools and assessing at least two different polynomial
models.
d) extract the equation of the best regression curve from the
summary of the respective model and plot the data set together
with the regression curve (500 points, x-max=20)
B
Table taken from Logan (2011) Table 9.1
8 Cluster
Cluster Analysis is a technique for grouping a set of individuals or objects into previously
unknown groups (called clusters). The task is to assign these objects to clusters in a way that
the objects in the same cluster are more similar (in some sense or another) to each other
than to those in other clusters. In biology, cluster analysis has been used for decades in the
area of taxonomy where organisms are classified into arbitrary groups based upon their
characteristics. Such characteristics can be binary (i.e. feature present or absent), numerical
(quantity of a feature) or factorial (e.g. color of feature). We as humans cluster things
intuitively based on rules derived from experience, a computer however needs more precise
rules, namely:
1. a measure of distance between objects or clusters
2. a method to determine this distance between clusters
3. an algorithm for clustering
We will now go trough these requirements and use R to cluster different datasets.
Some terms:
Objects/observations/samples or whatever you want to call them are
described by certain properties and these properties have values. In
mathematical terms we can see the observation as a vector and the
values of its properties as components (or coordinates) of that vector.
We may use the following mathematical notation:
Object=( value of Property1 , valueof Property 2 , ... , value of Property n )
Or in short: x =(x 1 , x 2 ,... , x n) .
Vectors are elements of a vector space. In the simplest case an
observation therefore can be seen as a point in a n dimensional
coordinate system, where n is the number of components of that vector.
Our set of observations then can be arranged in a matrix. The
observations are usually in the rows and the variables in the columns.

()( )
x x1 x2 .. x n
y = y1 y2 .. y n
z z1 z2 .. z n
.. .. .. .. ..

8.1 Measures of distance


Before we can attempt to group objects based on how similar they are we have to somehow
measure the proximity or distance of the one object relative to the other. There are several
ways to determine distances mathematically, depending amongst others on the type of data
at hand. Some of these measures are outlined in the following.
If we have binary qualitative data, say, two objects that are characterized by the presence
or absence of certain features we can use the Jaccard distance to determine how dissimilar
these two objects are. In Figure 27 we can see two objects A and B. Some features (points in
the figure) are present only in A some only in B and some in both.

2 +5 7
So the Jaccard distance in the above example would be d Jaccard ( A , B)= = meaning
10 10
that 70% of the features occur only in one of the objects.
In R we can use the dist() function
DISTANCEMATRIX=dist(DATASET, method="")
to compute the distances between several objects. dist() calculates the distance between
the rows of a matrix, so make sure your DATASET has the right format. The set of
comparison results you get back is called distance matrix. method="binary" gives you the
binary (Jaccard) distance. In R the input vectors are regarded as binary bits, so all non-zero
elements are on and zero elements are off. In these terms the distance can be seen as the
proportion of bits in which only one is on amongst those in which at least one is on, which
is an equivalent definition to the one given above.
Another improtant similarity measure often used in ecology is Bray-Curtis dissimilarity. It
compares species counts on two sites by summing up the absolute differences between the
counts for each species at the two sites and dividing this by the sum of the total abundances
in the two samples. The general formula for calculating the Bray-Curtis dissimilarity
between samples A and B is as follows, supposing that the counts for species x are denoted
by nAx and nBx:
m

n Ax n Bx
x=1
d Bray Curtis ( A , B)= m

(n Ax +n Bx)
x=1
This measure takes on values between 0 (samples identical: nAx= nBx . for all x) and 1 (samples
completely disjoint).
Exercise 27: You want to compare how similar two aquariums are. Calculate
Jaccard distance and Bray-Curtis dissimilarity with the formulas given above
and the data given below.

Number of
indiviudals in
Aquarium 1 3 2 4 6
Aquarium 2 6 0 0 11

If we have quantitative data the two most common distances used are Euclidean distance
and Manhattan (city-block) distance. Let us look at an example: We have analyzed two
different rocks for their content in calcium and silicium and we find the following

(silicium ) ( ) ()
r 12 1 silicium( r 22 2 )( )()
R1= calcium = r 11 = 0 a.u. and R2 = calcium = r 21 = 2 a.u. (a.u.=arbitrary units).

If we plot calcium against silicium (Figure 28) we can see two points which represent the
two different rocks. How different are they? A very intuitive way to think of the distance
between these two points is the direct connection between them (continuous line). This
distance is called euclidean distance and can be easily calculated with the Pythagorean
theorem; d (R1, R2 )=(r 11 r 21)2 +(r 12 r 22 )2 =5 , as you know from school. Another way
to calculate the distance is to follow the dotted lines as if we walked around in
Manhattan. Then we get the Manhattan (city-block) distance:
d (R1, R2 )=r 11 r 21+r 12 r 22=3 . Obviously, the obtained distances are not the same, the
distance between two objects very much depends on how you measure it.

Euclidean: d ( x , y)= ( x k y k )2 . If n=2 we have the case we saw in the example before.
k=1
n
Manhattan: d ( x , y)= x k y k . If n=2 we have the case we saw in the example before.
k=1

In the dist() function method="euclidean" and method="manhattan" respectively would


give you the aforementioned distances.

These distance measures depend on the scale. If two components are


measured on different scales you want to consider some standardization
first, so that the components contribute equally to the distance. Otherwise
the larger components will always have more influence.

In ecology euclidean distance has to be used with care. The difference


between none (0) and 1 individuals occurring mathematically is less than
the difference between 1 and 4 individuals occurring ecologically this can
be a huge difference. In ecology, for example for comparing species
composition at different sites, you would therefore rather use other
distance measures like

Exercise 28: Calculate the Manhattan and euclidean distance between the two
objects a=(1,1, 2, 3) and b=( 2, 2,1, 0) . One way to solve this is to
use R as a normal calculator applying the formulas above (or do it in your
head). The second one is to create two vectors a=c(a1,a2,a3,a4) and b and
use rbind(Vector1, Vector2)to combine them to a matrix. Then you can use
the dist() function.

Further reading about distance measures:


http://www.econ.upf.edu/~michael/stanford/maeb4.pdf
explanation of euclidean distance
http://www.econ.upf.edu/~michael/stanford/maeb5.pdf
explanation of non-euclidean distance

8.2 Agglomerative hierarchical clustering


One possible way of clustering is called agglomerative hierarchical clustering. It starts out
with each observation being a one-item cluster by itself. Then the clusters are merged until
only one large cluster remains which contains all the observations. At each stage the two
nearest clusters are combined to form one larger cluster. There are different methods to
determine which clusters are the nearest, they are called linkage methods.

8.2.1 Linkage methods


Have a look at Figure 29. Method A considers the (euclidean) distance between the nearest
two objects of two cluster as the distance between these clusters. This is called
single-linkage (or nearest neighbor). Analogously method B is called complete-linkage (or
farthest-neighbor) and takes the maximum distance between objects in different clusters as
the distance. This method brings about more compact clusters than single-linkage. In
practice you would often use more sophisticated methods, which lead to better results. You
can take the average distance between all pairs of objects within the two clusters (called
Average linkage) or you could use Ward linkage which leads to rather homogeneous groups. As
the distance measures you can use the ones introduced before for all of these linkage
methods.

A simple algorithm for agglomerative hierarchical clustering could look like this:
1. Initially all objects you want to cluster are alone in their cluster
2. Calculate the distances between all clusters using your linkage method
3. Join the closest two clusters
4. Go back to step 2 until all objects are in one comprehensive cluster
8.2.3 Clustering in R
In R we can use the function hclust(DISTMATRIX, method="") of the stats package. With
DISTMATRIX being a distance matrix obtained by the dist() function and method="" being
one of the following: "single", "complete" or "ward". Further linkage methods exist but
are now of no concern for us, you can look them up under ?hclust.
Let's get started:
Import the dataset PMM.txt in R. The variables are measured in grossly different ranges.
This might result in faulty clustering (try it if you want). Therefore we want to standardize
the range of the data first. We can use decostand() of the package vegan.
PMMn=decostand(PMM,"range")
range sets the highest value of each variable to 1 and scales the other accordingly so
they are now percentage of maximum.
Have a look at the dataframe. We are now interested in clustering the elements and not the
observations. The objects we want to cluster have to be in the rows of the dataframe.
Therefore we need to transpose our datamatrix (change rows and columns). This is done by
PMMnt=t(PMMn)
Now let us calculate the distances between the elements. Since these are measured on a
ratio scale it makes sense to use euclidean distances.
distm=dist(PMMnt, method="euclidean")
Now we use this distance matrix as input for our clustering. Let us first use single as
clustering method.
ahclust=hclust(distm, method="single")
A graphical ouptut can easily be obtained by plot(ahclust). This gives you a dendrogram
where we can see how closely the observations are related. The length of branches shows
the similarity of the objects. You can see that a lot of elements are added to existing
clusters in a stepwise fashion i.e. one after the other. This is a peculiarity of the single-
linkage method.
If you have a lot of objects, presenting the result as dendrogram is not very pretty anymore.
It is more useful to know the assignment of each objects to a certain cluster at a certain
number of clusters present. For that we use the R function cutree(DATA,#CLUSTERS). The
result is a vector with as many components as we have objects and each component tells
you the cluster for that object. If we want to divide our data into two groups we therefore
can now use
ahclust_2g=cutree(ahclust,k=3)
to get the assignment of each object to one of these groups. You can type the variable name
ahclust_2g to get some idea about this assignment.
Exercise 29: Try out the other linkage methods. Are there significant
differences? Check by plotting all the dendrograms into one big graph (see
Chapter 4.4 Combined figures if you forgot how to do that).
BONUSPOINTS: Cluster the non-standardized dataset PMM. Do the results
make sense?

Another function for hierarchical clustering with interesting possibilities is


agnes which can be found in the package cluster. You may want to
check ?agnes for details

8.3 Kmeans clustering


Another approach to clustering which is gaining importance is kmeans clustering. The basic
idea is to assign objects randomly to clusters and then reorder this assignment until you
find the best solution.
The procedure follows an easy way to classify a given data set through a certain number of k
clusters which you determine before. In the beginning you define k centroids in an n-
dimensional coordinate system, one for each cluster. The next step is to take each point
belonging to a given data set and associate it to its nearest centroid. After completion of this
first step we calculate k new centroids as barycenters of the clusters resulting from the
previous step. After we have these k new centroids, a new assignment of the data points to
the now nearest centorid follows. This is repeated and the k centroids change their location
step by step until no more changes are done.
An algorithm for kmeans clustering might be
1. Choose k points in the space which is represented by the objects that are being
clustered. These points represent initial cluster centroids.
2. Calculate the distance between the centroids and the objects.
3. Assign each object to the group with the centroid they are closest to.
4. When all objects have been assigned, recalculate the positions of the k centroids by
averaging the vectors of all objects assigned to it.
5. If there has been a change repeat Steps 2, 3 and 4 until the centroids no longer move,
else you are finished.

The basic function in R is kmeans(DATA,#CLUSTERS) of the stats package.


Let us use the same dataset as above then
km=kmeans(PMMnt,4)
will sort our objects into 4 different clusters. You don't need to calculate a distance matrix
before because the distances are calculated anew in each computation round. The obtained
datastructure is different then before. The vector which contains the assignment of the
objects to the different clusters can be obtained by
kmclust=km$cluster
kmclust
plot(kmclust)
will give you a representation of this clustering method. Understandably, we will not get a
dendrogram this time.
Well, that was easy. Almost as easy as the following exercises.

8.4 Chapter exercises


Exercise 30: Forests sites characterized by coverage (in %) of different plants
shall be classified. Cluster the dataset forests.csv with the kmeans and the
hclust method.
When importing the data make sure to set row.names=1. Example command:
forests<-read.csv("forests.csv", header=TRUE, row.names=1)
Check the data first. Is some transformation beforehand necessary? Is the
dataset in the right format for the dist() function?
You may use any linkage procedure, use 4 classes in the kmeans clustering.
Create appropriate representations of your results.
BONUSPOINTS: Are the results of hclust and kmeans clustering on the level of
4 clusters the same? Compare the results of both algorithms qualitatively by
1. creating a table which shows the assignment in both cases. cbind()
might come in handy.
2. plotting both result vectors in the same plot. You can use lines(DATA,
type="p", col="red") to add the second plot to the first.
Would you say clustering is an objective statistical method?

Exercise 31: The districts of the Baltic can be grouped by composition of the
algae species. In Error: Reference source not found you see the different
sites on a map. Cluster the sites in the dataset algae_presence.csv with
agglomerative hierarchical clustering and a linkage method of your choice.
Use the presence/absence of species for classification. What distance
measure should you use? Look at the dendrogramm. Do the results make
sense?
Repeat the exercise after you did a Beals transformation (see the following
infobox or ?beals) of the data. What distance measure should you use? Do
your results make more sense?

Beals transformation:
Beals smoothing is a multivariate transformation specially designed for
species presence/absence community data containing noise and/or a lot
of zeros. This transformation replaces the observed values (i.e. 0 or 1)
of the target species by predictions of occurrence on the basis of its
co-occurrences with the other remaining species (values between 0 and
1). In many applications, the transformed values are used as input for
multivariate analyses.
In R Beals transformation can be performed with the beals() function of the
vegan package.

8.5 P
Cluster analysis can not be regarded as objective statistical method because:
The choice of similarity index is done by the user.
Each different linkage procedure gives different results.
The number of groups is chosen by the researcher.

Further reading:
Afifi, May, Clark (2012): Practical Multivariate Analysis, CRC Press. Chapter
16: Cluster Analysis
A good introduction which is well understandable but more in-depth than
in this script.

http://www.econ.upf.edu/~michael/stanford/maeb7.pdf
Explanation of hierarchical clustering with examples.

bio.umontreal.ca/legendre/reprints/DeCaceres_&_Legendre_2008.pdf
A discussion about Beals transformation
8.6 R code library for cluster analysis
Function Arguments Use
library(stats) stats contains many basic
statistical tool
library(vegan) vegan contains specific tools for
ecologists
dist(x, method="") x: a numeric matrix, data frame or "dist" object. Calculate the distance between the
method: the distance measure to be used. rows of a matrix and returns a
distance matrix.
Must be "euclidean", "maximum", "manhattan",
"canberra", "binary" or "minkowski".
decostand( x, method) x: community data in a matrix standardization
method: the standardization method. E.g.
'normalize', 'standardize', 'range'.
See ?decostand for details.
t() Transposes a matrix
hclust(x, method="") d: a dissimilarity structure (distance matrix) as Agglomerative hierarchical
produced by dist. clustering
method: the agglomeration method to be used.
This should be one of "ward", "single",
"complete", "average", or others.
cutree(tree, k = , h = ) tree: a tree as produced by hclust. Cuts a tree created by
hierarchical clustering at a certain
k: desired number of groups
heigth or cluster number.
h: heigth where the tree should be cut.
at least one of k or h must be specified, k
overrides h if both are given.
kmeans(x, centers) x: your input, has to be a numeric matrix of Kmeans clustering
data
centers: the number of clusters, say k.
beals(x) x: input community data frame or matrix. Performs a beals transformation of
the data.
Further parameters and details can be looked
up with ?beals.
9 Or
One of the most challenging aspects of multivariate data analysis is the sheer complexity of
the information. If you have a dataset with 100 variables, how do you make sense of all the
interrelationships present? The goal of ordination methods is to simplify complex datasets
by reducing the number of dimensions of the objects. Recall, that in the cluster analysis part
we defined objects with features as vectors with components. These objects can be thought
of as points in a n-dimensional space with the values of the respective components giving
you the coordinates on n different coordinate axes. Up until n=3, this is easily conceivable
but it works exactly in the same for n>3.
The easiest way of dimension reduction would be to only consider one variable, e.g. the first
component of each vector and discard the rest for your analysis. This is of course not very
reasonable because you will lose a lot of information. Therefore different ordination
techniques have been developed that minimize the distortion of such a dimension
reduction. We will focus on two of these methods: Principal Component Analysis (PCA) and
non-metric multidimensional scaling (NMDS).

9.1 Principle Component Analysis (PCA)


Principal Component Analysis (PCA) is a powerful tool when you have many variables and
you want to look into things that these variables can explain. The basic idea behind PCA is,
that the variables in observations are correlated amongst each other, so the full dataset
contains information which is redundant. PCA is now useful to reduce the number of
variables, because with PCA we can look for supervariables that sum up the information
of several variables, without losing much of the information that the original data have.
Furthermore, we can also use PCA for finding structures and relationships in the data, for
example outliers.
More mathematically, PCA uses an orthogonal linear transformation to transform your
data of possibly correlated variables to a new coordinate system defined by a set of linearly
uncorrelated variables, called principal components. So it finds linear projections of your
data which preserve the information your data have.

9.1.1 The principle of PCA explained


A possible, graphic way to describe the general procedure of PCA is outlined in the following
and for the easy case of having only two variables the procedure is illustrated in Figure 32:
1. First we standardize our variables in a way so that they have a mean of zero.
If we think of our variables as coordinates we move the origin of our coordinate
system to the mean of all variables.
2. Similar to creating a regression line we now look for our first Principal Component
(PC) by finding a combination (called a linear combination) of the original components
in such a way that the new PC variable accounts for the most variance in our data.
If we think of it graphically, each variable represents a coordinate axis. We now find
a coordinate axis in a way that it points into the main direction of the data spread.
This new axis is a combination of our original axes.
3. We repeat step 2. The next principal component is again a linear combination that
accounts for the most variance in the original variables, under the additional
constraint that it has to be orthogonal that means uncorrelated to all the other
PCs we already calculated.
If we think of it graphically, we create a further coordinate axis but it has to be
orthogonal to all the others we already created (in the example case of only having 2
variables, there is now only one possible option). We can repeat this step and find as
many PCs as we have original variables.
As result of the PCA we get a new coordinate systems with the axes represented by the PCs
as seen in Figure 33. We get as many PCs as we had original variables. These PCs are
combinations of these original variables. In this new reference frame, note that variance is
greater along axis 1 than it is on axis 2. Secondly, note that the spatial relationships of the
points are unchanged, the process has merely rotated the data, but no information was lost.
Finally, note that our new vectors, or axes, are uncorrelated.

We need to introduce two more definitions that are used in discussing the results of a PCA.
The first is component scores, sometimes called factor scores. Scores are the transformed
variable values corresponding to a particular data point i.e. its new coordinates. Loading is
the weight by which each original variable is multiplied to get the component score. The
loadings tell you about the contribution of each original variable to a PC, so a high loading
means the variable determines the PC to a large extent.
The exact mathematical reasoning and procedure of PCA shall be of no concern for us here.
We want to focus more on the application and interpretation of the results, so let's rather
get started in R.
Further reading:
http://yatani.jp/HCIstats/PCA
a simple explanation of PCA which also explains how to interpret the
results.
http://strata.uga.edu/software/pdf/pcaTutorial.pdf
well comprehensible, more advanced description
http://ordination.okstate.edu/overview.htm
PCA and other ordination techniques for ecologists.
9.1.2 PCA in R
There are several possibilities to perform a PCA in R. We use a basic function from the stats
package: princomp(DATASET, cor=TRUE). cor specifies if the PCA should use the covariance
matrix or a correlation matrix. As a rough rule, we use the correlation matrix if the scales of
the variables are unequal. This is a conscious choice of the researcher!
An alternative to princomp() is the command principal() from the psych
package.
Let us work again with a dataset we already know and love: PMM.txt. Load the dataset, and
then we can use
PMM_pca <- princomp(PMM, cor=TRUE)
to carry out a complete PCA and get 15 principal components, their loadings and the scores
of the data. The first step of a PCA would be to calculate a covariance or correlation matrix.
However, the function will calculate it for us and we can use our raw data as input.
A basic summary of our analysis can be obtained by
print(PMM_pca)
summary(PMM_pca)
To get an idea about the data it is common to plot the scores of the 1st PC against the scores
for the 2nd. You could simply plot(pca$scores[,1:2]) but a nicer output can be achieved by:
plot(PMM_pca$scores[,c(1,2)], pch=20)
text(PMM_pca$scores[,1],PMM_pca$scores[,2])
abline(0,0); abline(v=0)
To get an overview, we can create a scatterplot matrix, for example like this:
pairs(PMM_pca$scores[,1:4], main="Scatterplot Matrix of the scores of
the first 4 PCs")
We will get a scatterplot matrix of all these components against each other.
Since we want to know which variables have the greatest influence on our data, we want to
have a look at the loadings of the PCs. One way to do this is to just type the variable name:
PMM_pca$loadings
A graphical representation can be obtained by:
barplot(PMM_pca$loadings[,1], ylim=c(-0.5,0.5),ylab="Loading",
xlab="Original Variables", main="Loadings for PC1")
which shows which elements have the highest influence on the first PC.
A very common display for PCA results with scores as points in a coordinate system (e.g.
first and second component) and the loadings as vectors in the same graph is called a biplot.
The biplot is easily obtained:
biplot(PMM_pca, choices=1:2); abline(0,0); abline(v=0)
choices selects the PCs to plot. It is quite useful in analysing and interpreting the results.
Finally, we can have a look at the scree plot, plotting eigenvalues or variance agains the PC
number:
plot(PMM_pca, type="lines")
We see that the most information is in the first component and from the 6th onward there
is not much information in the component anymore. From the summary we know that the
first 6 components explain 90% of the variance. In the next part we will see methods how to
determine which PCs are still useful for further analysis.

Exercise 32: Repeat the plots above, but this time looking at the
relationship between the 1st and the 3rd Principal Component.

9.1.2.1 Selecting the number of components to extract


So, how many Principal Components are useful for further analysis? As often, there is no
absolute truth, but there are a couple of methods that generally lead to good results. The
most common approaches are based on the eigenvalues. As we know, the first PC is
associated with the largest eigenvalue, the second PC with the second-largest eigenvalue,
and so on. One criterion suggests retaining all components with eigenvalues greater than 1
because components with eigenvalues less than 1 explain less variance than contained in a
single original variable. Another possibility is a scree test. The scree plot will typically
demonstrate a bend or elbow, and all the components above plus one below this sharp
break are retained.
Exercise 33: Do a scree plot of the PMM dataset. How many components
would you retain based on the two criteria explained above?

9.1.3 PCA exercises


Exercise 34 (based on an example in: Cook, Swayne: Interactive and Dynamic
G
Leptograpsus occur. One species
L. variegatus has split into two new species, previously grouped by color:
orange or blue.
Variable Explanation
species orange or blue
sex male or female
group 1-4: orange&male, blue&male, etc.
frontal lip (FL) length, in mm
rear width (RW) width, in mm
carapace length (CL) length of midline of the carapace, in mm
carapace width (CW) maximum width of carapace, in mm
body depth (BD) depth of the body; for females, measured after displacement of the
abdomen, in mm

The mai
sex of the crabs based on these five morpho
would like to have one si
fi
subtasks:
1. View your data. From univariate box-plots assess whether any i
variable is sufficient for discrimi s or s .
Possibility 1, Old schoo
your plots (10!) clearly arranged i
plot(a,b). Possibility 2: Fas melt the data set, then
ggplot+facet.grid
2. Ho signifi difference?
Tes
groups.
BONUSPOINTS: Create a scatterplot matrix of all measured variables
agai
co to co
to. Alternative: GGally::ggpairs().
Does this help us i
3. Perform a PCA on the dataset. Si
the same scale we use the PCA with the correlation matrix.
4. Plot the scores of the firs
dis
Create a plot of the scores of the firs
Create a scatterplot matrix of the scores of the firs
ones dis
Can we determi
PC?
Hi
col=australian.crabs$group
5. Look at the loadi
that?
Create a plot of the loadi
have the highes
Create a biplot for PC1 an
give an i
6. Ho for further analy
7. BONUS: Your co
variables FL=0.91, RW=0.62, CL=0.81, CW=0.86, BD=0.90, but forgot to
write down whi
Hint: predict()
8. BONUS: Pe rform a clus
sam

9.1.4 Pr
Principle Component Analysis is not suited for all data. The main problem is that it assumes
a linear correlation between the observations and their variables. This is often not justified,
especially in ecology. Species for example often show a unimodal behavior towards
environmental factors. Other ordination techniques exist which might be more suitable.
One alternative is to use higher order PCA, polynomial PCA. Another possibility are
techniques that are summarized under the term multidmensional scaling (MDS) which will be
covered in the next part of this chapter.

9.2 Multidimensional scali


Multidimensional scaling (MDS) is a set of related ordination techniques often used for
exploring underlying dimensions that explain the similarities and distances between a set
of measured objects. Similar to PCA these methods are based on calculating distance
between objects. In PCA many axes are calculated from the data, but only a few are viewed,
owing to graphical limitations. In MDS, a small number of axes are explicitly chosen prior to
the analysis and the data are tted to those dimensions. So basically the distances of the
data are directly projected into a new coordinate systems. Unlike other ordination methods,
like for example PCA which assumes linear relationships in the data, MDS makes few
assumptions about the nature of the data, so is very robust and well suited for a wide
variety of data and more flexible. MDS also allows the use of any distance measure of the
samples, unlike other methods which specify particular measures, such as covariance or
correlation in PCA. So MDS procedures solve most of the problems of PCA and are gaining in
significance also in environmental and ecological application. In community ecology non-
metric multidimensional scaling (NMDS) is a very common method, which we will discuss
in the following.
Before you go on, you might find it helpful to revie
di 8.1 M .

9.2.1 Princip
You start out with a matrix of data consisting of n rows of samples and p columns of
variables, such as taxa for ecological data. From this, a n x n distance matrix of all pairwise
distances among samples is calculated with an appropriate distance measure, such as
Euclidean distance, Manhattan distance or, most common in ecology, Bray-Curtis distance.
The NMDS ordination will be performed on this distance matrix. In NMDS, only the rank
order of entries in the distance matrix (not the actual dissimilarities) is assumed to contain
the significant information. Thus, the purpose of the non-metric MDS algorithm is to find a
configuration of points whose distances reflect as closely as possible the rank order of the
origianl data, meaning that the two objects farthest apart in the original data should also be
farthest apart after NMDS and so on.
Next, a desired number of m dimensions is chosen for the ordination. The MDS algorithm
begins by assigning an initial location to each item of the samples in these m dimensions.
This initial configuration can be entirely random, though the chances of reaching the
correct solution are enhanced if the configuration is derived from another ordination
method. Since the nal ordination is partly dependent on this initial conguration, a
program performs several ordinations, each starting from a different random arrangement
of points and then select the ordination with the best t, or applies other procedures in
order to avoid the problem of local minima.
Distances among samples in this starting conguration are calculated and then regressed
against (compared with) the original distance matrix. In a perfect ordination, all
ordinated distances would fall exactly on the regression, that is, they would match the rank-
order of distances in the original distance matrix perfectly. The goodness of t of the
regression is measured based on the sum of squared differences between ordination-based
distances and the distances predicted by the regression. This goodness of t is called stress.
It can be seen as the mismatch between the rank order of distances in the data, and the rank
order of distances in the ordination. The lower your stress value is, the better is your
ordination.
The conguration is then improved by moving the positions of samples in ordination space
by a small amount in the direction in which stress decreases most rapidly. The ordination
distance matrix is recalculated, the regression performed again, and stress recalculated.
This entire procedure of nudging samples and recalculating stress is repeated until the
stress value seems to have reached a (perhaps local) minimum.
Furthe r reading:
http://s
Excelle
and the app
text.
http://www.unesco.org/w ebworld/idams/advg
Pr
level.
http://ordination.oks
NMDS and othe r ordination techniques for eco

9.2.2 NMDS in R
The function we want to use in R is called metaMDS of the package vegan. In order to
perform NMDS we first need to calculate the distance between items. metaMDS is a smart
function and will take on this task for you as well using vegdist. However, if you want to
scale your data and calculate the distance using a different function, metaMDS also accepts
a distancematrix as an input.
The vegan package i
s
i
by default. For non-eco
ordination.

One a
function in the MASS package.
So our work in R is rather easy. We load our forest dataset by (watch out for the correct
directory path!)
forests<-read.csv("forests.csv", header=TRUE, row.names=1)
Since metaMDS is a complex functions there are a lot of possible parameters. You will want
to check
?metaMDS
to see what possible parameters there are. The columns of the dataset should contain the
variables and the rows the samples. In our dataset this is the other way around, so we still
need to transpose it:
t_forests=t(forests)
Now a simple NMDS analysis of our dataset with the default settings could look like this:
def_nmds_for=metaMDS(t_forests)
We might wish to specify some parameters:
nmds_for=metaMDS(t_forests, distance = "euclidean", k = 3,
autotransform=FALSE)
distance is the distance measure used (see 8.1 Measures of distance), k is the number of
dimensions, autotramsform specifies if automatical transformations are turned on or off.
You can see which objects the metaMDS function returns by
names(nmds_for)
the important ones are
nmds_for$points #sample/site scores
nmds_for$species #scores of variables (species / taxa in ecology)
nmds_for$stress #stress value of final solution
nmds_for$dims #number of MDS axes or dimensions
nmds_for$data #what was ordinated, including any transformations
nmds_for$distance #distance metric used
We can view which parametes were used by writing the output variable name:
nmds_for
Important for us are the sample and variable scores, which we can extract by
variableScores <- nmds_for$species
sampleScores <- nmds_for$points
The column numbers correspond to the MDS axes, so this will return as many columns as
was specied with the k parameter in the call to metaMDS.
We can obtain a plot by:
plot(nmds_for)
Sites/samples are shown by black circles, the taxa by red crosses.
MDS plots can be customized by selecting either "sites" or "species" in display=, by
displaying labels instead of symbols by specifying type="t" and by choosing the dimensions
you want to display in choices=.
plot(nmds_for, display = "species", type = "t", choices = c(2, 3))

You can eve


typ
can the
p
l arge r window and by using cex to reduce the size of symb
An examp
plot(nmds_for, type="n")
p
points(nmds_for, display=c("sites"), choices=c(1,2), pch=3,
col="red")
p
(s
adjus
text(nmds_for, display=c("species"), choices=c(1,2), pos=1,
col="blue", cex=0.7)
p
& 2. Typi
p

9.2.3 NMDS Exe rci


Exe rci 35: Ordinate the fores
(E
param
Exe rci 36: Ordinate the a
a Bea

9.2.4 Conside rations and pr


(Based on a tutorial by S. Holland:
http://strata.uga.edu/software/pdf/mdsTutorial.pdf)
The ordination will be sensitive to the number of dimensions that is chosen, so this choice
must be made with care. Choosing too few dimensions will force multiple axes of variation
to be expressed on a single ordination dimension. Choosing too many dimensions is no
better in that it can cause a single source of variation to be expressed on more than one
dimension. One way to choose an appropriate number of dimensions is perform ordinations
of progressively higher numbers of dimensions. A scree diagram (stress versus number of
dimensions) can then be plotted, on which one can identify the point beyond which
additional dimensions do not substantially lower the stress value. A second criterion for the
appropriate number of dimensions is the interpretability of the ordination, that is, whether
the results make sense.
The stress value reects how well the ordination summarizes the observed distances among
the samples. Several rules of thumb for stress have been proposed, but have been
criticized for being over-simplistic. Stress increases both with the number of samples and
with the number of variables. For the same underlying data structure, a larger data set will
necessarily result in a higher stress value, so use caution when comparing stress among
data sets. Stress can also be highly inuenced by one or a few poorly t samples, so it is
important to check the contributions to stress among samples in an ordination.
Although MDS seeks to preserve the distance relationships among the samples, it is still
necessary to perform any transformations to obtain a meaningful ordination. For example,
in ecological data, samples should be standardized by sample size to avoid ordinations that
reect primarily sample size, which is generally not of interest.
A real disadvantage of the NMDS consists in the use of ranks. If a PCA fits well to the data
then the NMDS results get worse.
9.3 R code l ibrary for ordination
Function Arguments Use
library(vegan) vegan contains specific tools for
ecologists
princomp(x, cor=) x: dataset Principal Component Analysis
cor: use correlation matrix (TRUE/FALSE)
screeplot(x, type=, main="") x: dataset Creates a scree plot
type: barchart or lines
main: title of graph
biplot(x, choices=) x: dataset Creates a so called biplot
choices: a vector with two components,
specifies which PCs to use in the biplot
metaMDS(x, distance= "", k =, x: dataset or distancematrix Performs nonmetric
autotransform=) distance: the distance measure that should multidimensional scaling
be used. E.g. "manhattan",
"euclidean", "bray", "jaccard".
K: number of dimesions
autotransform: (TRUE/FALSE) performs
automatic transformations suited for
community ecology. Default is TRUE!
10 Spati
The analysis of spatial data is a very hot topic in the R community. Things change very fast
and it is very difficult to get an overview of the different ends of methods and packages. If
you are interested in the subject, we would recommend to subscribe to the R-SIG-Geo
mailing list and read the book of Bivand et al. 2008.
http://www.spatialanalysisonline.com/output/html/R-
Projectspatialstatisticssoftwarepackages.html
http://cran.r-project.org/web/views/Spatial.html
Bivand, Roger S., Pebesma, Edzer J., Gmez-Rubio, Virgilio, 2008: Applied
Spatial Data Analysis with R, Series: Use R!, Springer, 2008, XIV, 378 p., ISBN
978-0-387-78170-9 - Available for students of CAU Kiel as free ebook
https://stat.ethz.ch/mailman/listinfo/R-SIG-Geo/
R Special Interest Group on using Geographical data and Mapping

Because of the limited time available for this subject we will focus on the practical aspects
of spatial analysis, i.e. things you might need if you add maps to your statistical project or
final thesis. This includes mainly import of vector and raster maps, plotting of maps and
statistical analyses.
First, we need to define the different types of spatial data
Point data, e.g. the location and one or more properties like the location of a tree and
its diameter. Normally this type is considered the simplest case of a vector file, but
we treat it separately, because mapping in ecology means frequently going out with
a GPS and writing down (or recording) the position and some properties (e.g. species
composition, occurrence of animals, diameter of trees...)
Vector data with different sub-species like a road or river map (normally coming
from a vector GIS like ArcGIS).
Grid-Data or raster data are files with a regular grid like digital images from a camera,
a digital elevation model (DEM) data or the results of global models.

10.1 First example


First, install the libraries spatstat and gpclib
library (spatstat)
To get an overview about the main functions you can use
demo(spatstat)
A first example shows us the location of trees
data(swedishpines)
X <- swedishpines
str(X)
plot(X)
summary(X)
plot(density(X, 10))

10.2 Point Dat

Example from http://help.nceas.ucsb.edu/R:_Spatial

Point is possibly the most frequent application for ecologists. Typically, positions are
recorded with a GPS device and then listed in Excel or even as text.
The procedure in R to convert point data to a internal or ESRI-map is straightforward:
read in the data
define the columns containing the coordinates
convert everything to a point shapefile
Following is a brief R script that reads such records from a CSV file, converts them to the
appropriate R internal data format, and writes the location records as an ESRI Shape File.
The fileLakes.csv contains the following columns. 1: LAKE_ID, 2: LAKENAME, 3:
Longitude, 4: Latitude. For compatiblility with ArcMap GIS, Longitude must appear
before Latitude.
library(sp)
library(maptools)
LakePoints = read.csv("Lakes.csv")
Columns 3 and 4 contain the geographical coordinates.
LakePointsSPDF =
SpatialPointsDataFrame(LakePoints[,3:4],data.frame(LakePoints[,1:4]))
plot(LakePointsSPDF)
Now write a shape-file for ESRI Software.
maptools:::write.pointShape(coordinates(LakePointsSPDF),data.frame(Lak
ePointsSPDF),"LakePointsShapeRev")
writeSpatialShape(LakePointsSPDF,"LakePointsShapeRev2")

10.2.1 Bubble p
A quite useful chart type is the spatial bubble plot the size of the bubble is proportional to
the value of the variable
library(sp)
library(lattice)
#data(meuse)
#coordinates(meuse) = ~x+y
## bubble plots for cadmium and zinc
data(meuse)
coordinates(meuse) <- c("x", "y") # promote to SpatialPointsDataFrame
bubble(meuse, "cadmium", maxsize = 1.5, main = "cadmium concentrations
(ppm)", key.entries = 2^(-1:4))
bubble(meuse, "zinc", maxsize = 1.5, main = "zinc concentrations
(ppm)", key.entries = 100 * 2^(0:4))

10.3 Raster dat


Raster data are quite common in ecology. They can come as a digital elevation model (DEM),
a satellite image. R offers a full range of functions, therefore you should first load the
library(raster)
library(rgdal)
The easiest way to import a grid is to use the gdal library, but we have to convert them
manually to the raster format.
lu87grd = readGDAL("lu87.asc")
Check type and structure of the variable
str(lu87grd)
lu87=raster(lu87grd)
str(lu87)
The structure of raster is different.
spplot(lu87)
demgrd = readGDAL("dem.asc")
dem=raster(demgrd)
spplot(dem)
First, let us check the frequency distribution of elevation
hist(dem)
lu07grd = readGDAL("lu07.asc")
lu07=raster(lu07grd)
spplot(lu07grd)
An analysis of different land use frequencies is similar
hist(lu07)

To show you how maps are used for statistics we want to find out the land use type on steep
slopes.
slopegrd = readGDAL("slope.asc")
slope=raster(slopegrd)
spplot(slope)
hist(slope)
Extract all cells with slope >4
steep = slope>4
Multiply with land use multiplication with 0 is 0, for 1 the value of land use is taken.
lu_steep = steep * lu87
Finally, count the different classes
freq(lu_steep)
37: Ca
elevations > 1000m (code for fores
increasing fores
Hints: us

10.4 Vector Dat


Vector data are normally handled with a GIS. A very common software product is ArcGIS,
but this package is also very expensive and due to the copy protection very complicated to
install and use. Therefore, some open source packages are worth a try.

Some open source or free GIS packages


http://grass.fbk.eu/ GRASS, available for many operating systems,
the oldest and biggest system
http://www.qgis.org/ QGIS, for Windows, Linux and MacOS X
http://52north.org/communities/ilwis the ILWIS GIS, similar
to ArcView 3.2
The following tutorial is taken from: Paul Galpern, 2011: Workshop 2: Spatial
Analysis Introduction, R for Landscape Ecology Workshop Series, Fall 2011,
NRI, University of Manitoba (http://nricaribou.cc.umanitoba.ca/R/)

Unfortunately, R is not very suitable for vector data, therefore we suggest that you prepare
the vector files as far as possible with a real GIS. If you really want to take a close look at
vector maps in R you can read the following help files and the book from Blivand et al. 2008.

Adrian Baddeley, 2011: Handl ing shap


http://cran.r-pr

First, load the required libraries


library(rgeos)
library(maptools)
library(raster)
Read the shape file in a R map. Shape-files, i.e. files with an extension .shp are vector files
in an ArcView format which can be used by all GIS packages.
vecBuildings <- readShapeSpatial("patchmap_buildings.shp")
vecRoads <- readShapeSpatial("patchmap_roads.shp")
vecRivers <- readShapeSpatial("patchmap_rivers.shp")
vecLandcover <- readShapeSpatial("patchmap_landcover.shp")
str(vecRivers)

Next, plot the four maps


plot(vecLandcover, col="grey90", border="grey40", lwd=2)
plot(vecRoads, col="firebrick1", lwd=3, add=TRUE)
plot(vecRivers, col="deepskyblue2", lwd=10, add=TRUE)
plot(vecBuildings, cex=2, pch=22, add=TRUE)

Because R is not good in vector maps we convert everything to a raster. First, we define size
and extent of the new raster map
rasTemplate <- raster(ncol=110, nrow=110, crs=as.character(NA))
extent(rasTemplate) <- extent(vecLandcover)
The final conversion is
rasLandcover <- rasterize(vecLandcover, rasTemplate, field="GRIDCODE")
The field="GRIDCODE" part defines the variable which contains the code for the land use.
rasBuildings <- rasterize(vecBuildings, rasTemplate)
rasRoads <- rasterize(vecRoads, rasTemplate)
rasRivers <- rasterize(vecRivers, rasTemplate)
Final, control the result with a plot
plot(rasLandcover)
plot(rasBuildings)
plot(rasRoads)
plot(rasRivers)

A simple application of map operations is e.g. the creation of a buffer zone around streets or
buildings. This can be done with the edge function which draws a line around the edges of
a raster
ras2 <- boundaries(rasRoads, type="outer")
but you can check with
plot(ras2)
that only the edges are drawn. To add one map to the other we use
rasRoads2 <- cover(rasRoads, ras2)
You can also join the commands above to one:
ras2=raster(rasBuildings,layer=2)
ras3 = boundaries(ras2, type="outer")
# falscher Datentyp
rasBuildings <- cover(rasBuildings, ras3)
rasBuildings <- cover(rasBuildings, edge(rasBuildings, type="outer"))
rasRoads <- cover(rasRoads, edge(rasRoads, type="outer"))

The nal step is to combine the buildings, roads, rivers, and landcover rasters into one. We
will cover the landcover raster with the other three.
Examining the rasBuildings plot, you will notice that the roads are assigned a value of 1
and non-roads are assigned a value of 0. In order to cover one raster over the another, we
need to set these 0 values to NA. On a raster NA implies that a cell is transparent. So lets do
this for all the covering rasters:
rasBuildings[rasBuildings==0] <- NA
rasRoads[rasRoads==0] <- NA
rasRivers[rasRivers==0] <- NA
The features on each of these three rasters have a value of 1. In order to dierentiate these
features on the nal raster we need to give each feature a dierent value. Recall that our
landcover classes are 0 to 4. Lets set rivers to 5, buildings to 6, and roads to 7. It seems to
be standard practise to use a continuous set of integers when creating feature classes on
rasters.
rasRivers[rasRivers==1] <- 5
rasBuildings[rasBuildings==1] <- 6
rasRoads[rasRoads==1] <- 7
And now we can combine these using the cover function, with the raster on top rst, and
the raster on bottom last in the list:
patchmap <- cover(rasBuildings, rasRoads, rasRivers, rasLandcover)

You can now plot the map


plot(patchmap, axes = TRUE, xlab = "Easting (km)", ylab =
"Northing (km)", col = c(terrain.colors(5), "blue", "black", "red"))

or export it to a ArcGIS format


writeRaster(patchmap, filename="myPatchmap.asc", format="ascii")

10.5 Working with your own maps


If you want to work with your own maps, it is always a good idea to check the maps first
library(rgeos)
library(maptools)
To read the landuse map from the GIS-course you can first check the file
getinfo.shape("landuse.shp")
and then read it in and plot it.
myLanduse <- readShapeSpatial("landuse.shp")
plot(myLanduse)
Writing back a shapefile is also easy
writePolyShape(myLanduse,"testshape.shp")

The attributes of a map are stored in the so called slots. You get a list with
slotNames(myLanduse)
The slot we are interested in is data
str(myLanduse@data)
where you find all attributes of the map. You can manipulate these variables as usual, e.g.
myLanduse@data[1,]
If you want to manipulate or select data you can
attach(myLanduse@data)
myLanduse@data[GRIDCODE==1,1]
You could also use the rgdal library
library(rgdal)
myLand2 <- readOGR(dsn="landuse.shp"",layer="landuse")

To plot the landuse (and any other attribute) you can


spplot(myLanduse, "GRIDCODE")

or convert it first to a raster map:


LUTemplate <- raster(ncol=110, nrow=110, crs=as.character(NA))
extent(LUTemplate) <- extent(myLanduse)
The final conversion is
LUraster <- rasterize(myLanduse, LUTemplate, field="GRIDCODE")
11 Time
http://cran.r-project.org/web/views/TimeSeries.html - The currrent
task view of time series analysis in R
http://www.statsoft.com/textbook/sttimser.html: Tutorial on Time-
Series Analysis
http://www.itl.nist.gov/div898/handbook/pmc/section4/pmc4.htm
Chair of Statistics, 2011: A First Course on Time Series Analysis (Open
Source Book),
http://statistik.mathematik.uni-wuerzburg.de/timeseries/

11.1 Definitio
Time Series: In statistics and signal processing, a time series is a sequence of data points,
measured typically at successive times, spaced at (often uniform) time intervals. Time series
analysis comprises methods that attempt to understand such time series, often either to
understand the underlying theory of the data points (where did they come from? what
generated them?), or to make forecasts (predictions). Time series prediction is the use of a model
to predict future events based on known past events: to predict future data points before they
are measured. The standard example is the opening price of a share of stock based on its past
performance.
Trend: In statistics, a trend is a long-term movement in time series data after other
components have been accounted for.
Amplitude: The amplitude is a non negative scalar measure of a wave's magnitude of
oscillation
Frequency: Frequency is the measurement of the number of times that a repeated event
occurs per unit of time. It is also defined as the rate of change of phase of a sinusoidal
waveform. (Measured in Hz) Frequency has an inverse relationship to the concept of
wavelength.
Autocorrelation is a mathematical tool used frequently in signal processing for analysing
functions or series of values, such as time domain signals. Informally, it is a measure
of how well a signal matches a time-shifted version of itself, as a function of the
amount of time shift (the Lag). More precisely, it is the cross-correlation of a signal
with itself. Autocorrelation is useful for finding repeating patterns in a signal, such as
determining the presence of a periodic signal which has been buried under noise, or
identifying the missing fundamental frequency in a signal implied by its harmonic
frequencies.
Period: time period or cycle duration is the reciprocal value of frequency: T = 1/frequency

All citations from the corresponding keywords at www.wikipedia.org 2006


11.2 Dat
The data set for this part of the course is erle_stat_engl.csv, it contains the following
columns:

Name Content
Date Date
Peff Effective precipitation (mm)
Evpo_Edry Evaporation from dry alder carr (mm)
T_air Air temperature (C)
Sunshine Sunshine duration (h)
Humid_rel Relative Humidity (%)
H_GW Groundwater level (m)
H_ERLdry Water level in dry part of alder carr (m)
H_ERLwet Water level in wet part of alder carr (m)
H_lake Water level in Lake Belau (m)
Infiltra Infiltration into the soil (mm)

11.3 Dat

11.3.1 Conve rsion of variables to TS


First, read file into a data-frame in R:
t <- read.csv("erle_stat_engl.csv")
The following command sequences converts the text of a German date (31.12.2013) to an
internal date variable:
t$date <- as.Date(as.character(t$Date), format="%d.%m.%Y")
Another common format is 2013-12-31 which would be converted with :
t$date <- as.Date(as.character(t$Date), format="%Y-%m-%d")
The conversion as.character is sometimes necessary, because date values from files are
sometimes read in as factor variables.
It is useful to convert dates into a standard format available on many platforms, the posix
format which computes seconds from 1970.
t$posix <- as.POSIXct(date)
Where POSIXct shows the variable in a readable form, the alternative version POSIXlt
is better suited for data frames.
An easier way is the use of the following function from package Hmisc which converts text
from a file immediately into a date variable.
library(Hmisc)
t3 <- csv.get("erle_stat_engl.csv" , datevars="Date", dateformat="%d.
%m.%Y")
More information is available in the description of the packages chron and zoo, where zoo is
usefule for time series with unequal distances. Some important methods are:
DateTimeClasses(base) Date-Time Classes
as.POSIXct(base) Date-time Conversion Functions
cut.POSIXt(base) Convert a Date or Date-Time Object to a
Factor
format.Date(base) Date Conversion Functions to and from
Character
round.POSIXt(base) Round / Truncate Data-Time Objects
axis.POSIXct(graphics) Date and Date-time Plotting Functions
hist.POSIXt(graphics) Histogram of a Date or Date-Time Object

11.3.2 Creating factors fr


Factors play an important role in the classification of a data set. They correspond roughly
to the horizontal and vertical headers in a pivot-table in Excel. Examples of frequently used
factor variables are e.g. names of species in biology or years and months in time series.
Some factor-variables are already identified automatically when R reads a file.
Unfortunately, R also thinks that date variables in text form (1.1.2006) are factor
variables. The conversion of this variables is explained in chapter 11.3.1. To create a factor
based on time series we use mainly the following functions:
cut()und factor()
cut.POSIXt(base)
The creation of a factor variable with years is easy:
t$years = cut.Date(t$date, "years")
create a factor variable containing the years of the data set. The extraction of months and
weeks is similar (see help(cut.Date) for a summary of all possibilities).
You can check the results with
levels(t$years)
You can now use the factor to classifiy your data set for many functions. With the
commands
plot(t$years, t$H_GW)
boxplot(t$H_GW ~ t$years)
you get separate plots for each year.
The creation of a monthly variable is done by:
t$months = cut.Date(t$date2, "months")
The command creates a factor for each month, e.g. "Jan 1978", "Feb 1978" etc. Frequently
this is not what you want. If you need a mean value for all months in the data set (e.g. for
seasonal analysis) you have to extract the name or number of the months and then to
convert them into a factor which can be used for a boxplot etc.
mon_tmp <- format.Date(t$date,format="%m")
t$months <- factor(mon_tmp)
year_tmp <- format.Date(t$date,format="%Y")
t$years <- factor(year_tmp)
boxplot (t$T_air ~ t$months)

Another variable we use frequently is the Day number from 1 to 365


library(timeDate)
t$julianday=timeDate::dayOfYear(timeDate(t$date))
boxplot (t$T_air ~ t$julianday)
qplot (data=t, y=T_air, x=julianday,geom="jitter",col=years)

If the factors are defined, you can use the following function to create all kinds of
summaries (sums, mean....).
aggregate(T_air, list(n = months), mean)

With the dplyr library, the calculation of mean values is quite straigtforward
t_annual=dplyr::group_by(t,years)
airtemp=dplyr::summarise(clim_group,
mean_t=mean(AirTemp_Mean),
median=median(AirTemp_Mean))
qplot(as.numeric(years),mean_t,data=t_ann_mean,geom="line")
qplot(as.numeric(years),sum_prec,data=t_ann_mean,geom="line")

In a more modern syntax you get the results with


t_ann = reshape2::melt(data=t_ann_mean,id.vars=c("years"))
ggplot(data=t_ann,aes(x=as.numeric(as.character(years)),y=value))+
geom_line()+
facet_grid(variable ~ .,scales="free")

38: create b
and gr
39: ca
40: create a scatte rp
l ake wate r levels (hint: us
conve rsions, us

11.4 Statistical Analysis of TS

11.4.1 St
In statistics, TS are composed of the following subcomponents:
Y t=T t +S i + Rt
where
T = Trend, a monotone function of time t
S = one or more seasonal component(s) (cycles) of length/duration i
R = Residuals, the unexplained rest
The analysis of TS is entirely based on this concept. The first step is usually to detect and
eliminate trends. In the following steps, the cyclic components are analysed. Sometimes,
the known seasonal influence is also removed.
11.4.2 Trend Analysis
Normally, trend analysis is a linear or non-linear regression analysis with time as x-axis or
independent variable. Many authors also use the term for the different filtering algorithms
which are normally used to make plots of data look more smoothly.

11.4.2.1 Regression Trends


For the analysis of trends you can use the regression methods from Excel or R. Some
packages offer detrended TS, which are the residuals of a regression analysis. The basic
procedure is:
compute the regression equation (linear or non linear)
Y t=b 0+ b1x t
compute the detrended residuals
*
Y t =Y t (b 0+ b1x t )

41: use a linear model to remove the trend from the air temperature
(Hint: function lm, look at the contents of the results)

11.4.2.2 Filter
Some TS show a high degree of variation and the real information my be hidden in the high
variation of the data set. This is why there are several methods of filtering or smoothing a
data set. Sometimes this process is also called low pass filtering, because it removes the
high pitches from a sound-file and lets the low frequencies pass. The most frequently used
methods are splines and moving averages. Moving average are computed as mean values of
a number of records before and after the actual value. The range of averaging decides on
the smoothness of the curve. Is filtering is used to remove trends, detrended means the
deviations from the moving averages.

11.5 Removing seasonal influences


Seasonal influences are known effects that are reasonably stable in terms of annual timing,
direction, and magnitude. Possible causes include natural factors (the weather),
administrative measures (starting and ending dates of the school year), and social, cultural
or religious traditions (fixed holidays such as Christmas).
You can remove this influence
calculate the mean values for seasonal components (e.g. monthly mean values in a
data set with monthly values)
subtract this mean value from the original data set
Please keep in mind that the removal of seasonal influences is a very complicated process
there are many possibilities and methods.
42: Remove the seasonal trend from the air temperature (Hint: use
daynumbers)

11.6 Irregular time series


In ecology, time series often have an irregular spacing. There are several packages which
can be used to produce a regularly spaced data set.
help(zoo)
help (its)

Lundholm, M., 2011: Introduction to Rs time series facilities,


http://people.su.se/~lundh/reproduce/introduction_ts.pdf
contains a section about interpolation of irregular data

11.7 TS in R
First, we have to define the data set as a time series.
attach(t)
lake = ts(H_lake, start=c(1989,1),freq=365)
Next we can already plot an overview of the analysis
ts = stl(lake,s.window="periodic")
plot(ts)
or look at the text summary:
summary(ts)
A look at the structure of the results
str(ts)
reveils that you can extract the detrended and deseasonalized remainders with
clean_ts = ts$time.series[,3]
for further analysis. Please take a look at the help-page of the procedure to understand
what happens below the surface.

For time series analysis we often need so called lag variables, i.e. the data set moved back
or forth a number of timesteps. A typical example is e.g. Unit-Hydrograph, which compares
the actual discharge to the effective precipitation of a number of past days. This number is
called lag. You can create the corresponding time series with the lag function:
ts_test = as.ts(t$H_GW) # Groundwater
lagtest <- ts_test # temp var
for (i in 1:4) {lagtest <- cbind(lagtest,lag(ts_test,-i))}
Now check the structure and the content of lagtest.
43: analyse groundwater water level (detrend, remove seasonal
trends)

11.7.1 Auto- and Crosscorrelation


We continue to use our data set with the water level in the lake Belau.
The function to calculate autocorrelation is acf, you can check the help-file for syntax and
parameters with
help(acf)
A simple function call with no parameters uses a lag of 15 days:
erle_acf <- acf(H_ERLdry)
erle_acf
The following, more complex command is more useful and adapted to our data set, it
calculates autocorrelation for a whole year (365 days) and plots the coefficients.
erle_acf <- acf(H_ERLdry, lag.max=365, plot=TRUE)
The cross correlation analysis is very similar. The analyse the relation between water level
and precipitation
erle_ccf <- ccf(H_ERLdry, Peff, lag.max=30, plot=TRUE)
By splitting the output screen into several windows you can get a concise overview about
the relations between the different variables:
split.screen(c(2,2))
ccf(H_ERLdry, Peff, lag.max=30, plot=TRUE)
screen(2)
ccf(H_ERLdry, Evpo_Edry, lag.max=30, plot=TRUE)
screen(3)
ccf(H_ERLdry, H_lake, lag.max=30, plot=TRUE)
screen(4)
ccf(H_ERLdry, Infiltra, lag.max=30, plot=TRUE)
close.screen(all = TRUE)

44: analyse the influence of water level in the wet part and the lake
(H_ERLdry) and groundwater level (H_GW)
45: analyse the autocorrelation of different nutrients from the
wqual.data (see page 117 for a description)

11.7.2 Fourier- or spectral analysis


One of the most frequent problems in time series analysis is the detection and identification
of the cycles or periods in a data set. In real life this is similar to the analysis of the different
notes in a sound file.
The steps for a spectral analysis are1:

1 The original code was created by Earl F. Glynn <efg@stowers-institute.org>


air = read.csv(air_temp.csv")
TempAirC <- air$T_air
Time <- as.Date(air$Date, "%d.%m.%Y")
N <- length(Time)

oldpar <- par(mfrow=c(4,1))


plot(TempAirC ~ Time)

# Using fft (fast Fourier Transform)


transform <- fft(TempAirC)
# Extract DC component from transform
# modules if
dc <- Mod(transform[1])/N

# for help see help(spec.pgram)


periodogram <- round( Mod(transform)^2/N, 3)

# Drop first element, which is the mean


periodogram <- periodogram[-1]

# keep first half up to Nyquist limit


# The Nyquist frequency is half the sampling frequency
periodogram <- periodogram[1:(N/2)]

# Approximate number of data points in single cycle:


print( N / which(max(periodogram) == periodogram) )

# plot spectrum against Fourier Frequency index


plot(periodogram, col="red", type="o",
xlab="Fourier Frequency Index", xlim=c(0,25),
ylab="Periodogram",
main="Periodogram derived from 'fft'")
The plot reads as follows: a frequency index of 10 means that there are ten periods in the
dataset, the duration would be 10/N which is 365 days one year.
There is a second possibility to find the frequency distribution is the use of the spectrum
function.

# The same thing, this time using spectrum function


s <- spectrum(TempAirC, taper=0, detrend=FALSE, col="red",
main="Spectral Density")

# this time with log scale


plot(log(s$spec) ~ s$freq, col="red", type="o",
xlab="Fourier Frequency", xlim=c(0.0, 0.005),
ylab="Log(Periodogram)",
main="Periodogram from 'spectrum'")

cat("Max frequency: ")


maxfreq <- s$freq[ which(max(s$spec) == s$spec) ]
print(maxfreq)
# Period will be 1/frequency:
cat("Corresponding period\n")
print(1/maxfreq)
Please take care that the frequency corresponds to the whole data set (, i.e. 3652 points) and
not for a year day or the defined time step.

# restore old graphics parameter


par(oldpar)

Next, we can use a different approach with a different scaling. The base period is now 365
days, i.e. frequency of 1 means one per year.
air =read.csv("http://www.hydrology.uni-
kiel.de/~schorsch/air_temp.csv")
airtemp = ts(T_air, start=c(1989,1), freq = 365)
spec.pgram(airtemp,xlim=c(0,10))

# draw lines for better visibility


abline(v=1:10,col="red")

To compute the residuals, we use the information from spectral analysis to create a linear
model.

x <- (1:3652)/365
summary(lm(air$T_air ~ sin(2*pi*x)+cos(2*pi*x)+ sin(4*pi*x)
+cos(4*pi*x) + sin(6*pi*x)+cos(6*pi*x)+x))

46: analyse the periodogram of the lake water level before and after
the stl analysis

11.8 Sample data set for TS analysis


As a final example you can analyze the water quality data set of the Kielstau catchment, our
UNESCO-Ecohydrology demo site.
The sampling was carried out on a daily basis with an automatic sampler (daily sample) and
a weekly basis (manual sampling Schpfprobe). The water of the automatic sample was
not cooled, the collected samples were taken to the lab once a week, at this time the manual
sample was taken. The parameters of the data set are composed of water quality indicators
(nutrients), climate variables (temperature, rain) and hydrologic variables (discharge).
Possible questions about the data set are
what is the auto- or cross-correlation of the variables (i.e. how stable is the system,
does it react very fast...)
what is the relation between the variables (correlation analysis). Is this relation
independent of other variables (summer/winter)
the values for the manual and automatic sampling should be quite similar. Is this
true?
does the climate influence the nutrient contents (temperature, precipitation)?
do the hydrologic variables influence the nutrient contents?

x select a statistical question, present the results (max. 5 figures)


Field Content
Datum_Jkorrekt Date, readable
Datum Date numeric
week
Month
Year
NH4_N NH4 Sampler
S_NH4_N NH4 Manual Sample
NO3
NO3_N
S_NO3_N
PO4_P
S_PO4_P
Ptot Total P
S_Ptot
Chlorid
S_Chlorid
Sulfat
S_Sulfat
Filt_vol Filtered volume
Sed Sediment
Q Discharge
W Water level
S_Watertemp
Quality_level
CloudCover
REL_HUMIDITY
VAPOURPRESSURE
AIRTEMP
AIR_PRESSURE
WINDSPEED
TEMP_MIN_SOIL
AIRTEMP_MINIMUM
AIRTEMP_MAXIMUM
WIND_MAXIMUM
PREC_INDEX
PRECIPITATION
SUNSHINE
SNOWDEPTH
S_Index Distance to manual sample
S_Number
reverse reverse distance to manual sample
Summer Summer yes/no
12 Practical Exercises
12.1 Tasks
The central question in the first units of the course is:
Has the climate of the Hamburg station changed since measurement began?
The question can be divided, for example, into the following sub-questions or sub-tasks:
Comparing winter precipitation intensities
Has the intensity of (winter) precipitation during the years 1959-89 changed in
comparison to the years 1929-59?
Are trends identifiable in annual mean, minimum and maximum temperature (linear
regression with time as the x-axis)?
Has the difference between summer and winter temperatures changed?

12.1.1 Pivot Tables


The following examples are organized according to level of difficulty.
Calculation of the mean and sum of annual temperatures (pivot table with year as outline
variable, which must be created from the date field).
Calculation of annual and monthly means or sums (pivot table with year and month as
outline variables)
Calculation of mean daily variation (calculation of the difference between minimum and
maximum, followed by pivot table)
Creation of cumulative curves of precipitation, calculation of the mean cumulative curve via
the data storage record.
? Calculation of the sum of temperatures within a given range (temperatures above 5C,
obtained via the if function).
Calculation of summer and winter precipitation (creation of a binary (0/1) variable for
summer and winter months using the if-function, followed by a pivot table with year and
the binary variable as outline variables).
Calculation of the onset of the vegetation period, defined as the day with a temperature
sum >200, coding with binary variables, extraction of the day-number using a pivot table.
Analysis of the vegetation period, defined by the time from temperature sum >200 until
11/01 (trees), calculation of the precipitation during the vegetation period.

12.1.2 Regression Line


All preciously created pivot tables can be used for trend analysis. It's possible to vary the
time period of the analysis, e.g. the last 5, 10, and 30 years.

12.1.3 Database Functions


Selection of the time of year
Amount of summer days or days with frost (selection with temperature<0, subsequently
calculate using a pivot table)
Obtain the mean precipitation intensities (selection of all days with precipitation<0,
subsequently calculate the mean value using a pivot table)
Calculation of the onset of the vegetation period, selection of day using the filter function

12.1.4 Frequency Analyses


An
Analysis of rain duration: definition of rain period: consecutive days with precipitation>0,
create if formula (add), if necessary distribute into multiple columns, calculate mean
duration using a crosstab (without day 0 !), and obtain the frequency.
13 Applied
Use the Hamburg climate dataset to compare the climate of the years 1950-1980 with 1981-
2010. The data set is already converted to R format, please load the workspace
climate.rdata to avoid formatting and conversion problems.
Select on variable and create a figure with 800x1200 pixels size and the following
contents:
a plot of the original data
a plot of annual, summer and winter mean values,
a boxplot of decadal values (use the as.integer to calculate the factors)
a violinplot of the two periods 1950-1980 and 1981-2010
a lineplot of the monthly means or sums for the two periods 1950-1980 and 1981-
2010
a boxplot of the daily values as function of period and month
put everything together in a 800x1200 pixels size (see Fig. Error: Reference source
not found), send us the result by email and have a nice Christmas :-)

Hints
prepare the figures step by step
use aggregate to calculate the annual and monthly summaries

The following variables should be analyzed:

Cloud_Cover
RelHum
Mean_Temp
Airpressure
Min_Temp_5cm
Min_Temp
Max_Temp
prec
sunshine
snowdepth

If you have some time left:


47: Use ANOVA to compare

48: Analyse the slope of the different variables. Is there a significant increase?
14 Solutio
Solution 2:
Climate$Year_fac = as.factor(Climate$Year)
Climate$Month_fac = as.factor(Climate$Month)

Solution Error: Reference source not found:

First Version:
Climate$Summer = 0
Climate$Summer[Climate$Month>5 & Climate$Month<10]=1

An alternative Version of the first command:


Climate$Summer[!(Climate$Month>5 & Climate$Month<10)]=0
The result is a numeric variable

Second Version:
Climate$Summer = (Climate$Month>5) & (Climate$Month<10)
The result is a boolean variable

Solution 10:
plot(Mean_Temp ~ Date, type="l")
lines(Max_Temp ~ Date, type="l", col="red")
lines(Min_Temp ~ Date, type = "l", col="blue")

Solution 12:
m2 = (Max_Temp+Min_Temp)/2
scatterplot(Mean_Temp ~ m2)
scatterplot(Mean_Temp ~ m2| Year_fac)

Solution Error: Reference source not found:


attach(Climate_original)
from1950=Climate_original[Year>1949 & Year <1981,]
from1981=Climate_original[Year>1980 & Year <2011,]
detach(Climate_original)
t1950 = aggregate(x = from1950$Mean_Temp, by =
list(from1950$Month),FUN = mean, simplify = TRUE)
t1981 = aggregate(x = from1981$Mean_Temp, by =
list(from1981$Month),FUN = mean, simplify = TRUE)
ymax=max(t1981$x,t1950$x)
ymin=min(t1981$x,t1950$x)
plot(t1950,ylim=c(ymin, ymax))
lines(t1981)
scatterplotMatrix(~ Mean_Temp + Max_Temp + Mean_RelHum + Prec + Sunshine_h |Summer)

Solution 37:

Zhlen der genderte Landnutzung


check=lu07==lu87
Wieviel Wlder sind von 1987 bis 2007 verschwunden in den Hhen ber
2000m?

Alle hhen < 1000 lschen

ue1000 =dem>1000
> t2=ue1000*dem
> spplot(t2)
> ue1000b=(dem>1000) * dem

forest87=lu87==1
forest07=lu07==1
ue1000 =dem>1000
forest87a=forest87*ue1000
forest07a=forest07*ue1000
# increase: 87=1, 07=0
diff87_07 = (forest87a ==1) & (forest07a == 0)
spplot(diff87_07)
summary(diff87_07)
Cells: 770875
NAs : 378939
Mode "logical"
FALSE "384320"
TRUE "7616" Decrease
NA's "378939"
# increase 87=0, 07 = 1
diff07_87 = (forest87a ==0) & (forest07a == 1)
spplot(diff07_87)
summary(diff07_87)
Cells: 770875
NAs : 378943
Mode "logical"
FALSE "370912"
TRUE "21020" increase
NA's "378943"
# any spatial patterns?
diff= diff87_07-diff07_87
spplot(diff)

Solution 38:
boxplot (H_lake ~ months)
boxplot (H_GW ~ months)
boxplot (H_lake ~ years)
boxplot (H_GW ~ years)

Solution 39:
aggregate(H_lake, list(n = months), mean)

Solution 43:

gw = ts(H_GW, start=c(1989,1),freq=365)
plot(stl(gw,s.window="periodic"))

Solution 44:
ccf(H_ERLwet, H_lake, lag.max=365, plot=TRUE)
ccf(H_ERLwet, H_GW, lag.max=365, plot=TRUE)

Solution 46:

Вам также может понравиться