Академический Документы
Профессиональный Документы
Культура Документы
Ecological
Data with R
Georg Hrmann
Institute for Natural Resource
Conservation
ghoermann@hydrology.uni-kiel.de
Ingmar Unkel
Institute for Ecosystem Research
iunkel@ecology.uni-kiel.de
Christian-Albrechts-Universitt zu
Kiel
Author's Copyright
This book, or whatever one chooses to call it, is subject to the GNU license (GPL, full details
available on every good search engine). It may be further distributed as long as no money is
requested or charged for it.
Table of Contents
1 Introduction............................................................................................7
1.1 Excursus: Freedom for Software: The Linux or Microsoft Question..................................8
5 Bivariate Statistics.................................................................................52
5.1 Pearsons Correlation Coefficient..........................................................................................53
5.2 Correlograms correlation matrices....................................................................................58
5.3 Classical Linear Regression.....................................................................................................60
5.3.1 Analyzing the Residuals..................................................................................................61
6 Univariate Statistics..............................................................................66
6.1 Student's t Test.........................................................................................................................66
6.2 Welsh's t Test............................................................................................................................68
6.3 F-Test..........................................................................................................................................69
6.4 c-Test Goodness of fit test...................................................................................................69
8 Cluster Analysis.....................................................................................79
8.1 Measures of distance................................................................................................................79
8.2 Agglomerative hierarchical clustering.................................................................................82
8.2.1 Linkage methods..............................................................................................................82
8.2.2 Clustering Algorithm.......................................................................................................83
8.2.3 Clustering in R..................................................................................................................84
8.3 Kmeans clustering....................................................................................................................85
8.4 Chapter exercises.....................................................................................................................87
8.5 Problems of cluster analysis...................................................................................................88
8.6 R code library for cluster analysis.........................................................................................89
9 Ordination.............................................................................................90
9.1 Principle Component Analysis (PCA)....................................................................................90
9.1.1 The principle of PCA explained......................................................................................90
9.1.2 PCA in R..............................................................................................................................93
9.1.2.1 Selecting the number of components to extract................................................94
9.1.3 PCA exercises....................................................................................................................94
9.1.4 Problems of PCA and possible alternatives..................................................................96
9.2 Multidimensional scaling (MDS)............................................................................................96
9.2.1 Principle of a NMDS algorithm.......................................................................................96
9.2.2 NMDS in R..........................................................................................................................97
9.2.3 NMDS Exercises................................................................................................................99
9.2.4 Considerations and problems of NMDS........................................................................99
9.3 R code library for ordination................................................................................................101
10 Spatial Data........................................................................................102
10.1 First example.........................................................................................................................102
10.2 Point Data..............................................................................................................................103
10.2.1 Bubble plots...................................................................................................................103
10.3 Raster data.............................................................................................................................104
10.4 Vector Data............................................................................................................................105
10.5 Working with your own maps............................................................................................107
12 Practical Exercises.............................................................................120
12.1 Tasks.......................................................................................................................................120
12.1.1 Pivot Tables...................................................................................................................120
12.1.2 Regression Line.............................................................................................................121
12.1.3 Database Functions......................................................................................................121
12.1.4 Frequency Analyses.....................................................................................................121
13 Applied Analysis.................................................................................122
14 Solutions............................................................................................124
Illustration Index
Figure 1: Workflow of an analysis.....................................................................................................11
Figure 2: Installation of Rcmdr..........................................................................................................13
Figure 3: Interface of Rcmdr..............................................................................................................13
Figure 4: After a successful import of the Climate data base........................................................14
Figure 5: File menu of Rmcdr, used to save command and data..................................................15
Figure 6: Rstudio user interface.........................................................................................................16
Figure 7: Source of the climate data set for Hamburg-Fuhlsbttel.............................................17
Figure 8: Contents of the climate archive data...............................................................................17
Figure 9: Common problems in spreadsheet files..........................................................................19
Figure 10: Structure of our climate data base (Hamburg).............................................................19
Figure 11: Import of data....................................................................................................................20
Figure 12: Settings for an import of the climate data set from the clipboard...........................20
Figure 13: Result of a data import.....................................................................................................21
Figure 14: Data import with RStudio................................................................................................22
Figure 15: Control of variable type...................................................................................................23
Figure 16: Frequent problem with a conversion of mixed variables...........................................24
Figure 17: Example of a good and bad database structure for daily time series.......................30
Figure 18: Example of a good and bad database structure for lab data......................................30
Figure 19: Structure of the "molten" data set.................................................................................32
Figure 20: Combining figures with the split.screen() command..................................................38
Figure 21: Layout frame......................................................................................................................39
Figure 22: Result of the layout commands.......................................................................................39
Figure 23: Common display of hydrological simulation results...................................................41
Figure 24: Scatterplot of air temperatures with annual grouping...............................................43
Figure 25: a Probability density function f (x) and b cumulative distribution function F (x)of
a 2 distribution with different values for ...................................................................65
Figure 26: a Probability density function f(x) and b cumulative distribution function F (x) of
a Students t distribution with different values for ....................................................66
Figure 27: Illustration of Jaccard distance.......................................................................................80
Figure 28 : Illustration of Euclidean and Manhattan distance......................................................81
Figure 29: Illustration of different ways to determine the distance between to clusters. For
example by single-linkage (A) or complete-linkage (B).................................................83
Figure 30: Illustration of the agglomerative hierarchical clustering algorithm.......................84
Figure 31: Illustration of the kmeans clustering algorithm..........................................................86
Figure 32: Illustration of the PCA principles...................................................................................91
Figure 33: Graphic result of a PCA.....................................................................................................92
Figure 34: Leptograpsus variegatus..................................................................................................94
Figure 35: Summary of one variable (Mean Temperature).........................................................123
1 Int
There are many books on statistics, difficult to digest and with a tendency to reprint
formulas. There are still more books for every possible type of software, in which
formatting and graphic creation is placidly explained. What's lacking is a compilation of the
methods and tools used daily in practice.
This book is not meant to replace statistical textbooks and programing handbooks, but is
rather meant as a summary for ecologists containing numerous practical tips which
otherwise would have to be gathered from many different sources. It was conceived as an
accompaniment to a course at the Ecology Center of the University of Kiel, in which
students of geography, biology, and agricultural science are introduced to analyzing data
records.
The students have mostly had an introduction to statistics and a basic course in data
processing. Their scope is generally limited and the connection between the two tends not
to have been made despite this knowledge being particularly fundamental and a
prerequisite by the time of the diploma thesis at the latest.
The aim of this book as well as that of our course is to give students an overview of the
methods and tools used to analyze data records based on measurements and modeling. The
structure of this book is built on the work flow used in the analysis of data.
In the review of tools we've made a point of emphasizing open-source software. This is
partly for financial reasons: small institutions and engineering firms often cannot even
afford large and expensive packages, the range of functions of which moreover are often
oriented more toward the needs of bookkeepers than those of scientists. Software from the
realm of natural sciences may often be arduous to learn but are in return more flexible and
productive in the long run.
The data sets for this course are available on a website in the internal e-learning system of
Kiel University (OLAT, https://www.uni-kiel.de/lms/dmz/) where example data and files as
well as the latest version of this book are available for download. Current links to the
recommended software can of course also be found there.
Presuppositions: this book doesn't provide an introduction into the various programs,
rather it presupposes basic knowledge of user software and operating systems. We cover
the area of things which the user needs in practical situations but which were never
mentioned in respective introductory courses.
The authors of this book have seen everything that can go right or wrong. They also take
the point of view that irony, amusement, and the regular enjoyment of Monty Python films
are fundamental requirements for survival when doing scientific work.
Comments on typography
Warnings are displayed like this. They point out everyday (oftentimes
banal) mistakes that can be set off by an hour-long search for the
cause
1: Exercises and homework for courses are marked like this
1.1 Excu
Ques
Word may have gotten around by now that we don't live in the best of all worlds; for one
thing PCs and software would be freely available if we did. On the one hand we as users
want to pay as little as possible, but at the same time the programers of our software can't
live on air and appreciation alone, at least not for long. The completely normal capitalistic
model has software sold as any other merchandise and the people who design and build it
paid as any other worker that's the Microsoft version.
The other side views things somewhat more idealistically: software is a human right and
should flow freely in the free stream of ideas. Users and programers constitute an organic
unit and continually develop the product together. The programer earns his/her money not
(only) through the software but through the related rendering of services. There are then
also people who develop software out of idealism and for whom an elegant piece of software
affords the same pleasure as a good concert that's the Linux version. It's significantly
more prevalent in the academic world because many programs developed with
governmental financial support are passed around free of charge.
Why Linux ?
Linux is available freely or inexpensively even for commercial application.
Linux is a modular operating system, so unused functions take up no storage space
and can't crash. It's thus possible, for example, when using systems for data logging
or when using a pure database system, to avoid graphical user interfaces altogether.
Linux systems are also serviceable remotely through slow connections no ifs or
buts. In case of an emergency, just about the whole system can be reinstalled online.
Linux systems run stabler and are less demanding on hardware.
Linux systems are fully documented - all interfaces etc.
Along with the technical arguments there's also the current financial situation of schools
and learning institutions of various levels, and also of many smaller firms. When the
operating system and the office suite together are more expensive than the computer on
which they're installed, many consider whether they shouldn't just buy two PCs with Linux
and LibreOffice. Then come the exorbitant prices for software in the technical/academic
world. When simulation software, geographical information systems (GIS), statistics
packages, and databases are all needed, then the price of the hardware becomes negligible.
Worse still is that the further development of office suites along with expensive updates for
technical / scientific versions have contributed practically nothing.
We have therefore decided in favor of a dual track: we discuss solutions to problems with
standard packages that are also applicable to open-source software (Excel, LibreOffice) and,
concerning more expensive special software for statistics, graphics, and data processing,
elaborate more upon free software.
Import data
Data types
Structure
Missing values
Check extreme values
Compute date/time
Advanced statistics
2.1 Installation
2.1.2.1 Rcmdr
To install the Rcmdr interface, use the following commands
select Packages->Install Packages from the R-Gui.
If you never worked before with packages, R will ask you which mirror file server it should
use select the one close to you or in the same network (e.g. Gttingen for the German
Universities). Next, select all packages of Rcmdr as shown in Fig. 2 and wait for the
installation to finish. After the installation you should first start the interface with
library(Rcmdr)
Before it starts it will load other additional packages from the internet. After this process,
Rcmdr will come up (Fig. 3) and is ready for work.
For the first steps in R we recommend Rcmdr, because it helps you to import data files and
builds commands for you. If you are more familiar with R you can switch to Rstudio, which
is the more modern GUI.
We will use the successful import of our data to R (Fig. 4) as an introduction to the basic
philosophy of Rcmdr. The program window consists of three parts: the script window, the
output window and the message wid RStudndow.
The script window contains the command sent by Rcmdr to R. This is the easiest way
to study how R works, because Rcmdr translates all commands you select with the
mouse in the interface to proper R code. You can also type in any command
manually. To submit a command or a marked block you have to click on the submit
button.
The output window shows the results of the operation you just submitted. If you type
in the command 3+4 in the script window and submit the line, R confirms the
command line and prints out the result (7).
The message window shows you useful information, e.g. the size of the database we
just imported.
Much of the power of R comes from a clever combination of the script windows and the file
menu shown in Fig. 5.
The commands dealing with the script file save or load the commands contained in a
simple text file. This means that all commands you or Rcmdr issues in one session can be
saved in a script file for later use. If you e.g. put together a really complex figure you can
save the script and repeat it whenever you need it with a different data set.
The same procedure can be used for the internal memory, called workspace in R. It
contains all variables in memory, you can save it before you quit your session and R reloads
it automatically next time. If you want to reload it manually you can use the original R
interface or load it from the data menu.
2.1.2.2 Rstudio
The Rstudio GUI (http://www.rstudio.com) has to be installed as any other windows
program (Fig. 6).
For Ubuntu-Linux you also have to download and install the software from the website, it is
not part of the software repository.
2.2 The Hambu
The data set we use in our course is a climate data set from the Hamburg station, starting in
the year 1891. The data set is part of the free climate data sets and you can download the
last version from the (horribly structured) website of the German Weather Service at
http://www.dwd.de/. (see Fig. 7 for a screenshot of the download location). The contents of
the download are shown in Fig. 8. You need only the file marked with the red circle, all
other files are documentation (in German). Rename the data file to something reasonable
like climate.txt (see Fig. 10). Do not worry about the German description of the
variables, we will change them immediately after the import.
2.3 Import of dat
For a good start of this lesson we need the data base and an interface to the R program.
The structure of the data base is shown in Fig. 10, but you can use any climate data set.
Fig. 9 shows some common problems in spreadsheet files which cause problems later on in
R. First, check that there is only one, rectangular matrix in a worksheet. Remove all old
intermediate steps like the ones shown in columns E-G in Fig. 9. Check also the lower end of
the spreadsheet for sums and other grouping lines. Second, check the variable/column
names. They should only contain good old fashioned ASCII-characters, no spaces (Fig. 9),
umlauts, operator symbols and other characters which have a special meaning (/,().
Third, check the columns for text, especially text used to define missing values (-) etc.
2.3.2 Import of text data (csv, ASCII) with Rcmdr
The first step of an analysis is to import the data. Usually, the data set is already available as
spreadsheet or text file and you need only the commands shown in Fig. 11.
The data set must have a rectangular form without empty rows or
columns
If you import the data set from the clipboard you should take care to fill out properly the
fields marked in red in Fig. 12, especially the decimal-point character and the field
separator.
Import of data in CSV format is also available in Rstudio. Figure 14 shows the import
function. The available options are the same as in Rcmdr.
Fig. 16 shows a frequent problem: If the imported file contains not only numbers, but also
text (see left side, text Missing), then the whole column is converted to a factor variable,
i.e. the variable cannot be used for computation, only for classification.
2.4 Working with vari ables
The next commands are hard to understand at first sight, but they are a cause of the
unmatched elegance and flexibility of R.
Climate[-1,]
All values except the first line
Climate[1:10,]
The first ten lines
Climate[-1,]
The first ten lines of columns 2-4, 7 and 9. The expression c() creates a list most
commands accept it as input.
Climate[Climate$AirTemp_Max>35,]
Get only data sets with AirTemp_Max>35
Climate$AirTemp_Max[(Climate$AirTemp_Max>19 &
Climate$AirTemp_Max<20)]
Get values between 19 AND 20 of Max_Temp. The logical OR condition is handled by
operator |.
If you want to keep the results and save it in a variable you can use the = operator.
climax = Climate[(Climate$AirTemp_Max>19 &
Climate$AirTemp_Max<20),]
Creates a new variable Climax with the contents of the selection.
If you do not want to type the name of the data matrix each time you need a variable, you
can use
attach(Climate)
to make variables inside a data matrix visible. After the attach command, the command
AirTemp_Max[(AirTemp_Max>19 & AirTemp_Max<20)]
will list the same result as the command above with full names. Another (politically correct)
method to access variables inside data from it the use the with function
with(Climate, AirTemp_Max[(AirTemp_Max>19 & AirTemp_Max<20)])
2.4.3 Coding date and time
The format of the date in Germany is usually in the form DD.MM.YYYY, in international publications it
is in ANSI-Form written as YYYY.MM.DD. This text-formatted dates are usually converted into
numbers. Normally, the day-count is the integer part of the coded number, the decimal fraction
represents the time of the day as the fractional day since midnight. What makes handling of date values
difficult is that different programs use different base values for the day-count. ANSI e.g. uses 1601-01-01
as day no. 1 while some spreadsheets use 1900-01-01 on PC and 1904-01-01 on Mac computers. It is
therefore highly recommended to use the text format for data exchange. The commands are explained
in chapter 11.3.1 on page 110.
For our climate data set we need real date, we have to convert the input to internal date
values.
Climate$Meas_Date=as.character(Climate$Meas_Date)
A conversion of the integer variable to text makes it easier to create the date.
Climate$Date= as.Date(Climate$Meas_Date, "%Y%m%d")
Convert the text to a real date. See the help for a complete list of all format options.
Climate$Year = format(Climate$Date, "%Y")
Extract years from the date we need this information later for annual values.
# check data type!
Climate$Year = as.numeric(Climate$Year)
The format function returns text, we convert it back to a number.
Climate$Month = format(Climate$Date, "%m")
Climate$Month = as.numeric(Climate$Month)
Climate$Dayno=Climate$Date-
as.Date(paste(Climate$Year,"/1/1"),"%Y /%m/%d")+1
In ecology we frequently need the daynumber from 1 to 365. We get it in R by subtracting
the value of 1st of January from the current date.
2.5 Simple
R has three different graphic subsystems. For a first overview we recommend the new
ggplot2 library.
library(ggplot2)
qplot(Date,AirTemp_Mean,data=Climate)
In case you do net specify the type of figure you want, ggplot2 make a guess.
qplot(Date,AirTemp_Mean,data=Climate,geom="line")
The geom parameter defines the type of figure you want have. In this case line is a good
choice.
qplot(Year,AirTemp_Mean,data=Climate,geom="boxplot")
qplot(as.factor(Year),AirTemp_Mean,data=Climate,geom="boxplot")
Some command cannot handle all data types, here we have to convert the numeric variable
Year to a factor variable.
qplot(as.factor(Month),AirTemp_Mean,data=Climate,geom="boxplot")
Boxplots are not always the best method to display data. If the distribution could be
clustered, the jiiter type is a good alternative. It displays all points of a data set.
qplot(as.factor(Month),AirTemp_Mean,data=Climate,geom="jitter")
One of the advantages of the qplot command is that you can use colours and matrix plots
out of the box.
qplot(as.factor(Month),AirTemp_Mean,data=Climate,geom="jitter",col
=Year)
qplot(as.factor(Year),AirTemp_Mean,data=Climate,geom="jitter",col=
Month)
It can be quite useful to plot data sets in annual or monthly subplots, with the facets
option you can plot one- or two-dimensional matrices of plots.
qplot(as.factor(Year),AirTemp_Mean,data=Climate,geom="jitter",face
ts= Month ~ .)
qplot(as.factor(Year),AirTemp_Mean,data=Climate,geom="line",facets
= Month ~ .)
qplot(as.factor(Dayno),AirTemp_Mean,data=Climate,geom="jitter",col
=Month)
qplot(as.factor(Dayno),AirTemp_Mean,data=Climate,geom="jitter",col
=Year)
2.6 Tasks
1: calculate a variable Climate$Summer where summer=1 and
winter = 0
2: Plot the summer and winter temperatures in a
boxplot
Climate$Year = as.numeric(as.character(Climate$Year))
If you forget this not really obvious step, you get the number of index, not of the value of a variable.
3.1 Da
Many data evaluations fail already in the preparatory stages and with them often a
hopeful junior scientist. It's one of the most moving scenes in the life of a scientist when a
USB stick or notebook computer with freshly processed data (or data deleted in the logger)
sinks in a swamp, ditch, or lake.
Taking heed of our own painful experience, we've placed a chapter before the actual focal
point of the book in which data organization and data safeguarding are gone over. Along
with it there's also a short overview of the vexing set of problems associated with various
date and time formats when working with time series.
3.2 Basic
In short, dplyr offers you the basic functions of data management
filter: select parts of the data set defined by filter conditions
select variables of a data set
arrange (sort) data sets
mutate: change values, calculate and create new variables
group: divide the data set in groups (e.g. years, months)
summarise: calculate mean, sums etc. for groups
combine/join two data bases
The use of the filter function is straightforward:
t5 = filter(Climate, AirTemp_Mean>5)
The select function also works as expected with names and numbers
t6 = select(t5, Date,Year,Month,Dayno,AirTemp_Mean)
You can also use the index of columns, but the version with names is more readable and
avoids problems if columns are deleted
t6 = select(t5, c(18:21,4))
You can use the arrange function to select the 100 hottest days in the data set and look for
signs of global change
t7=arrange(t6,desc(AirTemp_Mean))
5: Select the 100 hottest days and plot the temporal distribution as
a histogram in groups of 10 years
6: Select the 100 coldest days and plot the temporal distribution as
a histogram in groups of 10 years
You can easily use any R function to change values, but the politically correct way is to use
the mutate function. A common method in meteorology is to use the average of minimum
and maximum temperature as a replacement for the mean temperature.
t11 = mutate(Climate,New_Mean=(AirTemp_Max+AirTemp_Min)/2)
In pure R you get the same result with
t11$New_Mean = (Climate$AirTemp_Max+Climate$AirTemp_Min)/2)
The most common application for dplyr is the calculation of mean, sums etc., e.g. for
annual and monthly values. The first step is to create groups
clim_group=group_by(Climate, Year)
With this grouped data set you can use any function to calculate new values based on the
grouping
airtemp=summarise(clim_group,mean=mean(AirTemp_Mean),
median=median(AirTemp_Mean))
Now, its easy to create a quite complex figure with a simple command. Please note the +
as the last character in the line, we need it to continue the graphic command.
qplot(Dayno,value,data=Clim_melt) +
facet_grid(variable ~ .,scales="free")
3.4 Merging
To demonstrate how to merge two data bases we use a different data set. The
administration of the Pln district supervises a monitoring program of all lakes in the Pln
district (Edith Reck-Mieth). We have a data base of the annual measurements of chemical
properties and a data base of the static properties of the lakes like depth, area etc.
First, we read the two data bases and check content and structure.
chemie <- read.csv("chemie.csv", sep=";", dec=",")
stations <- read.csv("stations.csv", sep=";", dec=",")
In the second step we merge the two data bases. We can use the old style command
# join the two data bases
chem_all=merge(chemie,stations,by.x="Scode")
In dplyr syntax, the same result is produced by
chem_all2=inner_join(chemie,stations,by="Scode")
The lattice library contains a lot of useful chart types, e.g. dotplots
library(lattice)
dotplot(Mean_Temp~Year_fac)
# also available in ggplot2
qplot(Year_fac,AirTemp_Mean,data=clim2000)
qplot(Year_fac,AirTemp_Mean,data=clim2000,geom="jitter")
A scatterplot is a version of a line plot with symbols instead of lines. It is a very common
type used for later regression analysis. For a ggplot2 version of this figures see 4.6.1)
plot(AirTemp_Max, AirTemp_Min)
abline(0,1)
abline(0,0)
abline(lm(AirTemp_Min ~ AirTemp_Max), col="red")
lines(AirTemp_Min,AirTemp_Mean,col="green", type="p")
abline(lm(AirTemp_Max ~ AirTemp_Min), col="green")
There are also some new packages with more advanced functions. Try e.g.
library(car)
scatterplot(AirTemp_Max ~ AirTemp_Min | Year_fac)
For really big datasets the following functions can quite useful
library(IDPmisc)
iplot(AirTemp_Min, AirTemp_Max)
or
library(hexbin)
bin = hexbin( AirTemp_Min, AirTemp_Max,xbins=50)
plot(bin)
or
with(Climate,smoothScatter( AirTemp_Mean,AirTemp_Max))
or with ggplot2
qplot(data=Climate,AirTemp_Mean,AirTemp_Max,
geom="bin2d")
qplot(data=Climate,AirTemp_Mean,AirTemp_Max,
geom="hex")
If you do not like the boring blue colours, you can change then to rainbow patterns
qplot(data=Climate,AirTemp_Mean,AirTemp_Max)+
stat_bin2d(bins = 200)+
scale_fill_gradientn(limits=c(0,50), breaks=seq(0, 40, by=10),
colours=rainbow(4))
plot() op new fig
lines() comm and adds a l ine to an exi
abline() draws a s
10: p tim
sy
11: p
reshap
12:
13: ca
function.
14:
comp
4.4 Combine
For more complex and combined figures there are basically two choices in R: an easy to
understand matrix approach where all subfigures have the same size and a complex
approach where you can place your figures free on a grid.
4.4.1 Fig
For a figure with 4 elements (2 rows, 2 columns) we write
par(mfrow=c(2,2))
plot(AirTemp_Mean ~ Date, type="l", col="red", main="Fig 1")
plot(AirTemp_Max ~ Date, type="l", col="red", main="Fig 2")
plot(Prec ~ Date, type="l", col="red", main="Fig 3")
plot(Hum_Rel ~ Date, type="l", col="red", main="Fig 4")
4.4.2 Nes
A similar effect is produced with
split.screen(c (2, 2) )
screen(3)
plot(Prec ~ Date, type="l", col="red", main="Fig 3")
screen(1)
plot(AirTemp_Mean ~ Date, type="l", col="red", main="Fig 1")
screen(4)
plot(Hum_Rel ~ Date, type="l", col="red", main="Fig 4")
Here, screens can be addressed separately by their numbers. It is also possible to nest
screens. Screen(2) is split into one row and two columns which get screen number 5 and 6.
split.screen( figs = c( 1, 2 ), screen = 2 )
screen(5)
plot(Prec ~ Date, type="l", col="red", main="Fig 5 inside 2")
screen(6)
plot(Sunshine ~ Date, type="l", col="red", main="Fig 6 inside 2")
close.screen(all=TRUE
The result should look like Fig. 20.
Fig 1
Mean_Temp
10
-10
Date
Fig 2
Max_Temp
20
0
Date
Fig 3
40
Prec
20
0
37000 39000
Date
http://gallery.r-enthusiasts.com/ (probably down)
A nice selection of what is possible with graphics functions of R
http://research.stowers-institute.org/efg/R/
Another selection with more basic figures
1
s
2nd
3
r
d
row: two scatterplots with 1) min vs. max temperature and 2) the mean temperature vs.
Lattice is very well suited for the display of data sets with many (factor) variables, but the
syntax is different from normal figures and the display is not very flexible.
library(lattice)
First, let us start with some descriptive figures.
densityplot( ~ Climate$AirTemp_Max | Climate$Month_fac , data=Climate)
histogram( ~ AirTemp_Max | Month_fac , data=clim2000)
histogram( ~ AirTemp_Max+AirTemp_Min | Month_fac , data=clim2000)
Please note how the numeric variables (temperatures) and the factor variables are ordered.
All examples above print monthly plots of a temperature.
Scatterplots are very similar, only the definition of the variables is different:
xyplot(AirTemp_Mean ~ AirTemp_Max+AirTemp_Min | Month_fac ,
data=clim2000)
A very useful keyword is the grouping inside a figure.
xyplot(AirTemp_Mean ~ AirTemp_Max+AirTemp_Min, groups=Summer,
data=clim2000)
Here you can clearly see the different of summer and winter values.
Another useful feature is the automatic addition of a legend.
xyplot(AirTemp_Mean ~ AirTemp_Max+AirTemp_Min, groups=Summer,
auto.key=T, data=clim2000)
xyplot(AirTemp_Mean ~ AirTemp_Max+AirTemp_Min, groups=Summer,
auto.key=(list(title="Summer?")), data=clim2000)
A combination of all simple features makes it easy to get an overview of the dataset. In our
example it is quite apparent, that something went wrong in the year 2007 (Fig. 23).
xyplot(AirTemp_Mean ~ AirTemp_Max+AirTemp_Min, groups=Year_fac,
auto.key=list(title="Year",columns=7), data=clim2000)
Figure 24: Scatterplot of air temperatures with annual grouping
4.5 Brushing up
4.5.3 Colors
Colors in all pictures can be referred to by number or text.
plot(Max_Temp, Min_Temp, col=2)
is the same as
plot(Max_Temp, Min_Temp, col="red")
List of colors is printed with
colors()
If you want only shades of red
colors()[grep("red",colors())]
http://research.stowers-institute.org/efg/R/Color/Chart/
In depth information about colors in R and science
4.5.4 Legend
In the basic graphic system, the legends are not added automatically, you have to define
them separately like
plot(AirTemp_Max, AirTemp_Min)
lines(AirTemp_Min,AirTemp_Mean,col="green", type="p")
legend(20,-10, c("Max/Min", "Min/Mean"), col = c("black","green"), lty
= c(0,0), lwd=c(1,2), pch=c("o","o"), bty="n",merge = TRUE, bg =
'white' )
locator(1) # get the coordinates of a position in the figure
Again, the xy dimensions are the same as the data set. If you want to set the location with
the mouse you can use the following command
legend(locator(1), c("Max/Min", "Min/Mean"), col = c("black","green"),
lty = c(0,0), lwd=c(1,2), pch=c("o","o"), bty="n",merge = TRUE, bg =
'white' )
4.5.5 More than two axes
Each plot command sets the scales for the whole figure, the next plot command would
create a new figure. To avoid this, you have to create a new reference system in the same
figure.
First, we need more space on the right side of the plot and set margins for the second Y-
axis.
par(mar=c(5,5,5,5))
plot(AirTemp_Mean ~ Date, type="l", col="red", yaxt="n", ylab="")
As the y-axis is not drawn (yaxt="n") , we do it manually
axis(2, pretty(c(min(AirTemp_Mean),max(AirTemp_Mean))), col="red")
and finally add a title for the left axis
mtext("Mean Temp", side=2, line=3, col="red")
Now comes the second data set. To avoid a new figure we need to set
par(new=T)
The next lines are quite similar, except that we draw the y-axis on the right side (4).
plot(Prec ~ Date, type="l", col="green", yaxt='n', ylab="")
axis(4, pretty(c(0,max(Prec))), col="green")
mtext("Precipitation", side=4, line=3, col="green")
library(ggplot2)
library(ggplot2)
library(grid)
library(gridExtra)
p1 <- qplot(data=clim2000,x=Month_fac,facets= . ~ Year_fac,
y=AirTemp_Mean,geom="boxplot")
p2 <- qplot(data=clim2000,x=Month_fac,facets= . ~ Year_fac,
y=Prec,geom="boxplot")
p3 <- qplot(data=clim2000,x=Month_fac,facets= . ~ Year_fac,
y=Sunshine,geom="boxplot")
p4 = qplot(data=clim2000,x=Date,y=AirTemp_Mean,geom="line")
grid.arrange(p1, p2, p3, p4, ncol = 2, main = "Main title")
dev.off()
The multiplot function delivers nearly the same result.
Now we move up one step in the hierarchy, all plot commands would now be printed on the
full page.
upViewport(1)
### define second plot area
vp2 <- viewport(x = 1, y = 0, height = 0.5, width = 0.5,
just = c("right", "bottom"), name = "lower right")
### enter vp2
pushViewport(vp2)
### show the plotting region (viewport extent)
### plot another plot
bw.lattice <- qplot(data=clim2000,x=Month_fac,
facets= . ~ Year_fac,y=Prec,geom="boxplot")
print(bw.lattice, newpage= FALSE)
### leave vp2
upViewport(1)
vp3 <- viewport(x = 0, y = 1, height = 0.5, width = 0.5,
just = c("left", "top"), name = "upper left")
pushViewport(vp3)
bw.lattice <- qplot(data=clim2000,x=Month_fac,
facets= . ~ Year_fac, y=Sunshine,geom="boxplot")
4.7 S
The easiest way to save figures produced with R is to copy them with the clipboard (copy
and paste) directly in your text or presentation or to save them with File Save as to
an image file. However if you have more than one image or if you have to do the same image
over and over, it is better so save the figures automatically in a file. You can save figures in
different formats, below we show the commands to open a file in PDF, PNG and JPG format
respectively.
pdf(file = "FDC.pdf", width=5, height=4, pointsize=1);
png("acf_catments.png", width=900) # dim in pixels
jpeg("test.jpg",width=600,height=300)
plot(AirTemp_Mean ~ Date, type="l", col="red", main="Fig 1")
All graphic devices are closed with the command
dev.off()
The recommended procedure is to develop and test a figure on screen and wrap it in a file
as soon as the results are as expected.
For the ggplot library you have to use
fig1 <- qplot(data=clim2000,x=Month_fac,y=AirTemp_Mean,geom="boxplot")
ggsave("fig1.png",width=3, height=3) # dim in cm
ggpairs(t,columns=1:3)
You can also integrate density plot in ggpairs
ggpairs(t,columns=1:3,
upper = list(continuous = "density"),
lower = list(combo = "facetdensity"))
One of the most useful scatterplot versions is pairs, which also prints out the correlation
and the significance level. Unfortunately, pairs does not work with missing values in the
data set. This cleaning proces often removes half of the data set.
t2=t[complete.cases(t),]
pairs(t2[1:3], lower.panel=panel.smooth, upper.panel=panel.plot)
t2=t[complete.cases(t),]
pairs(t2, lower.panel=panel.smooth, upper.panel=panel.cor)
The following code is the definition of functions needed by pairs. You have to execute
them prior to the use of pairs.
panel.cor <- function(x, y, digits=2, prefix="", cex.cor)
{
usr <- par("usr"); on.exit(par(usr))
par(usr = c(0, 1, 0, 1))
r <- abs(cor(x, y))
txt <- format(c(r, 0.123456789), digits=digits)[1]
txt <- paste(prefix, txt, sep="")
if(missing(cex.cor)) cex <- 0.8/strwidth(txt)
test <- cor.test(x,y)
# borrowed from printCoefmat
Signif <- symnum(test$p.value, corr = FALSE, na = FALSE,
cutpoints = c(0, 0.001, 0.01, 0.05, 0.1, 1),
symbols = c("***", "**", "*", ".", " "))
text(0.5, 0.5, txt, cex = cex * r)
text(.8, .8, Signif, cex=cex, col=2)
}
4.9 3d Images
Plotting 3d images is no problem if you have already a grid with regular spacing. The
procedure here also works with irregularly spaced data, but keep in mind that the spatial
interpolation may cause two kinds or problems:
valleys and/or mountains in the image which are not found in the data.
Information at a smaller scale than the grid size may completely disappear in
the image
For the spatial interpolation we use the package akima
install.packages("akima")
library(akima)
The data set is not a real spatial data set but a time series of soil water content at different
depth. The third dimension here is time.
g <- read.csv("soil_water.csv", header=TRUE)
attach(g)
Define range of values
x0 <- -180:0
y0 <- 0:367
ak <- interp(g$Depth, g$Day, g$SWC)
The ranges can also be defined automatically by the functions:
x0 <- min(Depth):max(Depth)
y0 <- min(Day):max(Day)
The variable ak now contains a regular grid. Now we can plot all kinds of impressing
3dimensional figures. We start with a boring contour plot:
contour(ak$x, ak$y, ak$z)
A more colourful version codes the values of the z-column with all colours of the rainbow:
image(ak$x, ak$y, ak$z, col=rainbow(50))
A similar picture comes out from
filled.contour(ak$x, ak$y, ak$z, col=rainbow(50))
If you don't like the colours of the rainbow you can also use:
'heat.colors','topo.colors','terrain.colors', 'rainbow', 'hsv', 'par'.
A 3dim view is created by:
persp(ak$x, ak$y, ak$z, expand=0.25, theta=60, phi=30, xlab="Depth",
ylab="Day", zlab="SWC", ticktype="detailed", col="lightblue")
5 Bivariate
This chapter is based on the following literature sources:
Kabacoff, R.I., 2011. R in Action - Data Analysis and Graphics with R. Manning Publications
Co., Shelter Island, NY.
http://www.manning.com/kabacoff/
Logan, M., 2010. Biostatistical Design and Analysis Using R: A Practical Guide. Wiley-
Blackwell, Chichester, West Sussex.
http://eu.wiley.com/WileyCDA/WileyTitle/productCd-1405190086.html
Trauth, M.H., 2006. MATLAB recipes for earth sciences. Springer, Berlin Heidelberg New
York.
http://www.springer.com/earth+sciences+and+geography/book/978-3-642-12761-8?
changeHeader
Bivariate analysis aims to understand the relationship between two variables x and y.
Examples are
When the two variables are measured on the same object, x is usually identified as the
independent variable, whereas y is the dependent variable. If both variables were generated
in an experiment, the variable manipulated by the experimenter is described as the
independent variable. In some cases, both variables are not manipulated and therefore
independent. The methods of bivariate statistics help describe the strength of the
relationship between the two variables, either by a single parameter such as Pearsons
correlation coefficient for linear relationships or by an equation obtained by regression
analysis (Fig. 5-1). The equation describing the relationship between x and y can be used to
predict the y-response from arbitrary xs within the range of original data values used for
regression. This is of particular importance if one of the two parameters is difficult to
measure. Here, the relationship between the two variables is first determined by regression
analysis on a small training set of data. Then, the regression equation is used to calculate
this parameter from the first variable.
Correlation or Regression ?!
Correlation: Neither variable has been set (they are both measured) AND there is no implied
causality between the variables
Regression: Either one of the variables has been specifically set (not measured) OR there is and
implied causality between the variables whereby one variable could influence the other but the
reverse is unlikely.
The thirty data points represent the age of a sediment (in kiloyears before present) in a certain depth (in
meters) below the sediment-water interface. The joint distribution of the two variables suggests a linear
relationship between age and depth, i.e., the increase of the sediment age with depth is constant. Pearsons
correlation coefficient (explained in the text) of r = 0.96 supports the strong linear dependency of the two
variables. Linear regression yields the equation age=6.6+5.1 depth. This equation indicates an increase of the
sediment age of 5.1 kyrs per meter sediment depth (the slope of the regression line). The inverse of the slope
is the sedimentation rate of ca. 0.2 meters /kyrs. Furthermore, the equation defines the age of the sediment
surface of 6.6 kyrs (the intercept of the regression line with the y-axis). The deviation of the surface age from
zero can be attributed either to the statistical uncertainty of regression or any natural process such as
erosion or bioturbation. Whereas the assessment of the statistical uncertainty will be discussed in this
chapter, the second needs a careful evaluation of the various processes at the sediment-water interface
5.1 Pear
Correlation coefficients are often used at the exploration stage of bivariate statistics. They are
only a very rough estimate of a (recti-)linear trend in the bivariate data set. Unfortunately,
the literature is full of examples where the importance of correlation coefficients is
overestimated and outliers in the data set lead to an extremely biased estimator of the
population correlation coefficient. The most popular correlation coefficient is Pearsons
linear product-moment correlation coefficient (Fig. 5-1). We estimate the populations
correlation coefficient from the sample data, i.e., we compute the sample correlation
coefficient r, which is defined as
where n is the number of xy pairs of data points, sx and sy are the univariate standard
deviations. The numerator of Pearsons correlation coefficient is known as the corrected sum
of products of the bivariate data set. Dividing the numerator by (n1) yields the covariance
which is the summed products of deviations of the data from the sample means, divided by
(n1). The covariance is a widely-used measure in bivariate statistics, although it has the
disadvantage of depending on the dimension of the data.
Dividing the covariance by the univariate standard deviations removes this effect and leads
to Pearsons correlation coefficient r.
The dataset:
The synthetic data consist of two variables, the age of a sediment in kiloyears before
present and the depth below the sediment-water interface in meters. The use of synthetic
data sets has the advantage that we fully understand the linear model behind the data.
The data are represented as two columns contained in file agedepth.txt. These data have
been generated using a series of thirty random levels (in meters) below the sediment
surface. The linear relationship age=5.6 meters+ 1.2 was used to compute noise free values
for the variable age. This is the equation of a straight line with a slope of 5.6 and an
intercept with the y-axis of 1.2. Finally, some gaussian noise of amplitude 10 was added to
the age data.
We load the data from the file agedepth.txt using the import function of Rstudio
(separator white space, decimal .)
The value of r = 0.9342 suggests that the two variables age and depth depend on each other.
cor.test(~y + x)
x[31]=5
y[31]=5
plot(x,y)
cor.test(~y + x)
abline(lm(y ~ x), col="red")
After increasing the absolute (x,y) values of this outlier, the correlation coefficient increases
dramatically.
x[31]=10
y[31]=10
plot(x,y)
cor.test(~y + x)
abline(lm(y ~ x), col="red")
Still, the bivariate data set does not provide much evidence for a strong dependence.
However, the combination of the random bivariate (x,y) data with one single outlier results
in a dramatic increase of the correlation coefficient. Whereas outliers are easy to identify in
a bivariate scatter, erroneous values might be overlooked in large multivariate data sets.
abline(lm(y ~ x)) uses the basic graphical function A-B-line to add at regression trendline
based on the linear model (lm) of x and y.
The dataset:
Sokal and Rohlf (1997) present an unpublished data set (L. Miller) in which the correlation
between gill weight and body of the crab (Pachygrapsus crassipes) is investigated.
Exercise 19:
a) import the crab data set (crab.scv, separator ,)
b) assess linearity and bivariate normality using a scatterplot with marginal boxplots
c) Calculate the Pearson's correlation coefficient and test H0: =0 (that the population
correlation coefficient equals zero)
5.2 Correlograms
The dataset:
Its easier to explain a correlogram once youve seen one. Consider the correlations among
the variables in the PMM data set. Here you have 15 variables, namely different chemical
elements measured during a XRF (X-Ray fluorescence) scan of a sediment core taken from
an Alpine peat bog (Plan da Mattun Moor, PMM) in 2010. The core had a length of 143 cm
and was scanned in 1-cm-resolution. Hence, 143 data values are available for each element.
We load the data from the file PMM.txt using the import function of RStudio (separator
white space)
You can get the correlations using the following code:
options(digits=2)
cor(PMM)
Al Si S Cl K Ca Ti Mn Fe Zn Br Rb
Al 1,0000 0,9073 -0,1005 -0,0740 0,9167 -0,2974 0,9176 0,1688 0,4443 0,4805 -0,3899 0,7344
Si 0,9073 1,0000 -0,2246 -0,2298 0,7381 -0,5379 0,8063 0,1783 0,2953 0,2885 -0,5806 0,4900
S -0,1005 -0,2246 1,0000 0,1782 -0,1026 0,5458 -0,1194 0,0003 0,5452 -0,0443 0,1718 -0,0904
Cl -0,0740 -0,2298 0,1782 1,0000 0,0828 0,5761 0,0331 0,3657 0,1702 0,3451 0,4423 0,2326
K 0,9167 0,7381 -0,1026 0,0828 1,0000 -0,1509 0,9482 0,1429 0,4794 0,6092 -0,2269 0,8973
Ca -0,2974 -0,5379 0,5458 0,5761 -0,1509 1,0000 -0,3021 0,0238 0,0442 0,0480 0,6203 0,0523
Ti 0,9176 0,8063 -0,1194 0,0331 0,9482 -0,3021 1,0000 0,1917 0,5605 0,5442 -0,3700 0,8425
Mn 0,1688 0,1783 0,0003 0,3657 0,1429 0,0238 0,1917 1,0000 0,2784 0,4044 0,0644 0,1225
Fe 0,4443 0,2953 0,5452 0,1702 0,4794 0,0442 0,5605 0,2784 1,0000 0,3898 -0,0458 0,4875
Zn 0,4805 0,2885 -0,0443 0,3451 0,6092 0,0480 0,5442 0,4044 0,3898 1,0000 0,1412 0,6466
Br -0,3899 -0,5806 0,1718 0,4423 -0,2269 0,6203 -0,3700 0,0644 -0,0458 0,1412 1,0000 -0,0314
Rb 0,7344 0,4900 -0,0904 0,2326 0,8973 0,0523 0,8425 0,1225 0,4875 0,6466 -0,0314 1,0000
Which variables are most related?
Which variables are relatively independent?
Are there any patterns?
It isnt that easy to tell from the correlation matrix without significant time and effort (and
probably a set of colored pens to make notations). You can display that same correlation
matrix using the corrgram() function in the corrgram package.
library(corrgram)
corrgram(PMM)
Figure 5-3: Correlogram of the correlations among the variables in the PMM data frame
To interpret this graph (Fig. 5-3), start with the lower triangle of cells (the cells below the
principal diagonal). By default, a blue color and hashing that goes from lower left to upper
right represents a positive correlation between the two variables that meet at that cell.
Conversely, a red color and hashing that goes from the upper left to the lower right
represents a negative correlation. The darker and more saturated the color, the greater the
magnitude of the correlation. Weak correlations, near zero, will appear washed out.
where x is a data frame with one observation per row. When order=TRUE, the variables are
reordered using a principal component analysis of the correlation matrix. Reordering can
help make patterns of bivariate relationships more obvious. The option panel specifies the
type of off-diagonal panels to use. Alternatively, you can use the options lower.panel and
upper.panel to choose different options below and above the main diagonal. The text.panel
and diag.panel options refer to the main diagonal.
Figure 5-4: Different correlogram layouts using the corrgram package with the variables in the PMM data frame
The dataset:
Equivalent to the PMM data set, the STY1 data set consists of geochemical data produced
by XRF scanning of a sediment core from Lake Stymphalia in Greece (Unkel et al., 2011).
Reference:
Unkel, I., Heymann, C., Nelle, O., Zagana, H., 2011. Climatic influence on Lake Stymphalia
during the last 15 000 years, In: Lambrakis, N., Stournaras, G., Katsanou, K. (Eds.), Advances
in the Research of Aquatic Environment. Springer, Berlin, Heidelberg, pp. 75-82.
Exercise 20:
a) import the STY1 data set (STY1.txt, separator=Tab)
b) plot the following element-combinations in a nested (multi-plot) figure of 4 plots,
add a title (main) and a regression line (abline) and different color in each
respective plot:
Al-Si; Ca-Sr; Ca-Si; and Mn-Fe
explain what you see.
c) Calculate the Pearson's correlation coefficient and test H0: =0 (that the population
correlation coefficient equals zero) for these for element-combinations
d) produce first an unsorted and than a sorted (order=TRUE) correlation matrix of the
entire STY1 data set, both times only displaying the lower panel as shades.
5.3 Cl ass
Linear regression provides another way of describing the dependence between the two
variables x and y. Whereas Pearsons correlation coefficient provides only a rough measure
of a linear trend, linear models obtained by regression analysis allow to predict arbitrary y
values for any given value of x within the data range. Statistical testing of the significance of
the linear model provides some insights into the quality of prediction. Classical regression
assumes that y responds to x, and the entire dispersion in the data set is in the y-value (Fig.
5-5). Then, x is the independent, regressor or predictor variable. The values of x are defined
by the experimenter and are often regarded as to be free of errors. An example is the
location x of a sample in a sediment core. The dependent variable y contains errors as its
magnitude cannot be determined accurately. Linear regression minimizes the y deviations
between the xy data points and the value predicted by the best-fit line using a least-squares
criterion. The basis equation for a general linear model is
The regression line passes through the data centroid defined by the sample means. We can
therefore compute the other regression coefficient b0,
using the univariate sample means and the slope b1 computed earlier.
Exercise 21:
a) import the Nelson data set (nelson.csv, separator=, )
b) assess linearity and bivariate normality using a scatterplot with marginal boxplots
comment: the ordinary least squares method is considered appropriate, as there is
effectively no uncertainty (error) in the predictor variable (x-values, relative
humidity)
c) fit the simple linear regression model (y=bo+bx) and examine the diagnostics:
nelson.lm <- lm(WEIGHTLOSS~HUMIDITY, nelson)
plot(nelson.lm)
B
Table from Logan (2011), figure from Trauth (2006)
mean: The most popular indicator of central tendency is the arithmetic mean, which is
the sum of all data points divided by the number of observations
median: the median is often used as an alternative measure of central tendency. The
median is the x-value which is in the middle of the data, i.e., 50% of the
observations are larger than the median and 50% are smaller. The median of a
data set sorted in ascending order is defined as
if N is even if N is odd
Quantiles are a more general way of dividing the data sample into groups containing equal
numbers of observations. For example, quartiles divide the data into four groups,
quintiles divide the observations in five groups and percentiles define one hundred
groups.
degrees of freedom : the number of values in a distribution that are free to be varied.
Null hypothesis:
A biological or research hypothesis is a concise statement about the predicted or theorized
nature of a population or populations and usually proposes that there is an effect of a
treatment (e.g. the means of two populations are different). Logically however, theories
(and thus hypothesis) cannot be proved, only disproved (falsification) and thus a null
hypothesis (Ho) is formulated to represent all possibilities except the hypothesized
prediction. For example, if the hypothesis is that there is a difference between (or
relationship among) populations, then the null hypothesis is that there is no difference or
relationship (effect). Evidence against the null hypothesis thereby provides evidence that
the hypothesis is likely to be true. The next step in hypothesis testing is to decide on an
appropriate statistic that describes the nature of population estimates in the context of the
null hypothesis taking into account the precision of estimates. For example, if the null
hypothesis is that the mean of one population is different to the mean ofanother
population, the null hypothesis is that the population means are equal. The null hypothesis
can therefore be represented mathematically as: Ho: 1=2 or equivalently: Ho 1-2=0.
6 Uni
6.1 Stud
(Chapter text based on Trauth, 2006)
The Students t distribution was first introduced by William Gosset (18761937) who needed a
distribution for small samples (Fig. 6-1, 26). W. Gosset was an Irish Guinness Brewery
employee and was not allowed to publish research results. For that reason he published his
t distribution under the pseudonym Student (Student, 1908). The probability density
function is
The single parameter of the t distribution is the degrees of freedom. In the analysis of
univariate data, this parameter is = n1, where n is the sample size. As , the t
distribution converges to the standard normal distribution. Since the t distribution
approaches the normal distribution for >30, it is not often used for distribution fi tting.
However, the t distribution is used for hypothesis testing, namely the t-test.
The Students t-test by Gossett compares the means of two distributions.
Let us assume that two independent sets of na and nb measurements that have been carried
out on the same object. For instance, several samples were taken from two different
outcrops. The t-test can be used to test the hypothesis that both samples come from the
same population, e.g., the same lithologic unit (null hypothesis) or from two different
populations (alternative hypothesis). Both, the sample and population distribution have to be
Gaussian. The variances of the two sets of measurements should be similar. Then, the
proper test statistic for the difference of two means is
where na and nb are the sample sizes, sa 2 and sb 2 are the variances of the two samples a
and b. The alternative hypothesis can be rejected if the measured t-value is lower than the critical
t-value, which depends on the degrees of freedom = na+nb2 and the significance level
. If this is the case, we cannot reject the null hypothesis without another cause. The
significance level of a test is the maximum probability of accidentally rejecting a true null
hypothesis. Note that we cannot prove the null hypothesis, in other words not guilty is not
the same as innocent.
Exercise 22:
a) import the Ward data set (ward.csv, separator=, )
b) We then asses assumptions of normality and homogeneity of variance for the null
hypothesis that the population mean egg production is the same for both littorinid
and mussel zone Lespiella:
boxplot(EGGS~ZONE, ward)
with(ward, rbind(MEAN=tapply(EGGS, ZONE, mean),
VAR=tapply(EGGS,ZONE,var)))
with(data, expr, )
is a generic function that evaluates an expression in an environment
constructed from data, possibly modifying the original data.
tapply(X, INDEX, FUN = NULL, ..., simplify = TRUE)
Applies a function to each cell of a ragged array, that is to each
(non-empty) group of values given by a unique combination of the
levels of certain factors.
Conclusions 1
there is no evidence of non-normality (boxplots not grossly asymmetrically) or
unequal variance (boxplots very similar in size and variances very similar).
Hence the simple student t-test is likely to be reliable. To test the null hypothesis
as formulated above.
Conclusions 2
reject the null hypothesis (i.e. egg production is not the same). Egg production
was significantly greater in mussel zone than in littorinid zone.
Exercise 23:
a) import the furness data set (furness.csv, separator=, )
b) We then asses assumptions of normality and homogeneity of variance for the null
hypothesis that the population mean metabolic rate is the same for both male and
female fulmars.
boxplot(METRATE~SEX, furness)
with(furness, rbind(MEAN=tapply(METRATE, SEX, mean),
VAR=tapply(METRATE, SEX,var)))
Conclusions 1
Whilst there is no evidence of non-normality (boxplots not grossly
asymmetrically), variancesare a little unequal (one of the boxplots is not more
than three times smaller than the other). Hence, a separate variances t-test
(Welsh's test) is more appropriate than a pooled variances t-test (Student's test).
6.3 F-Test
(Chapter based on Trauth, 2006)
The F distribution was named after the statistician Sir Ronald Fisher (18901962). It is used for
hypothesis testing, namely for the F-test . The F distribution has a relatively complex probability
density function.
The F-test by Snedecor and Cochran (1989) compares the variances sa and sb of two distributions,
where sa2 >sb2. An example is the comparison of the natural heterogeneity of two samples based on
replicated measurements. The sample sizes na and nb should be above 30. Then, the proper test
statistic to compare variances is
The two variances are not significantly different, i.e., we reject the alternative hypothesis, if the
measured F-value is lower than the critical F-value, which depends on the degrees of freedom a=
na1 and b= nb1, respectively, and the significance level .
24: a) perform an F-test for the Ward data set (Egg, ZONE)
b) perform an F-test for the Furness data set (METRATE, SEX)
c) create the following artificial data set and perform an F-test:
x <- rnorm(50, mean = 0, sd = 1)
y <- rnorm(50, mean = 1, sd = 1)
var.test(x, y)
now vary the standard deviation (sd) and the number of data
points and describe what you see.
The 2-test introduced by Karl Pearson (1900) involves the comparison of distributions, permitting
a test that two distributions were derived from the same population. This test is independent of the
distribution that is being used. Therefore, it can be applied to test the hypothesis that the
observations were drawn from a specific theoretical distribution. Let us assume that we have a data
set that consists of 100 chemical measurements from a sandstone unit. We could use the 2-test to
test the hypothesis that these measurements can be described by a Gaussian distribution with a typical
central value and a random dispersion around. The n data are grouped in K classes, where n should
be above 30. The frequencies within the classes Ok should not be lower than four and never be zero.
Then, the proper statistic is
where Ek are the frequencies expected from the theoretical distribution. The alternative hypothesis
is that the two distributions are different. This can be rejected if the measured 2 is lower than the
critical 2, which depends on the degrees of freedom =KZ, where K is the number of classes and
Z is the number of parameters describing the theoretical distribution plus the number of variables
(for instance, Z=2+1 for the mean and the variance for a Gaussian distribution of a data set of one
variable, Z=1+1 for a Poisson distribution of one variable)
By comparing any given sample chi-square statistic to its appropriate c2 distribution, the
probability that the observed category frequencies could have be collected from a population with a
specific ratio of frequencies (for example 3:1) can be estimated. As is the case for most hypothesis
tests, probabilities lower than 0.05 (5%) are considered unlikely and suggest that the same pie is
unlikely to have come from a population characterized by the null hypothesis. Chi-squared tests are
typically one-tailed tests focusing on the right-hand tail as we are primarily interested in the
probability of obtaining large chi-square values. Nevertheless, it is also possible to focus on the left-
hand tail so as to investigate whether the observed values are "too good to be true".
The c-distribution takes into account the expected natural variability in a population as well as the
nature of sampling (in which multiple samples should yield slightly different results). The more
categories there are, the more likely that the observed and expected values will differ. It could be
argued that when there are a large number of categories, samples in which all the observed
frequencies are very dose to the expected frequencies are a little suspicious and may represent
dishonesty on the part of the researcher.
Zar (1999) presented a data set that depicted the classification of 250 plants into one of four
categories on the basis of seed type (yellow smooth, yellow wrinkled, green smooth, and green
wrinkled). Zar used these data to test the null hypothesis that the samples came from a population
that had a 9:3:3:1 ratio of these seed types.
First, we create a data frame with the Zar (1999) seed data
We should convert the seeds data frame into a table. Whilst this step is not strictly necessary, it
ensures that columns in various tabular outputs have meaningful names:
We assess the assumption of sufficient sample size (<20% of expected values <5) for the specific null
hypothesis.
Conclusion 1 all expected values are greater than 5, therefore the chi-squared statistic is likely to
be a reliable approximation of the c distribution.
Now, we test the null hypothesis that the samples could have come from a population with a 9:3:3:1
seed type ratio.
chisq.test(seeds.xtab,p=c(9/16,3/16,3/16,1/16), correct=F)
Conclusion 2 reject the null hypothesis, because the probability is lower than 0,05. the samples are
unlikely to have come from a population with a 9:3:3:1 ratio.
1. to develop a better predictive model (equation) than is possible from models based on
single independent variables
2. to investigate the relative individual effects of each of the multiple independent variables
above and beyond the effects of the other variables.
library(car)
scatterplot.matrix(~Ca+Ti+K+Rb+Sr+Mn+Fe,data=example2, diag="boxplot")
Conclusion 1 element Mn varies obviously non-normal (asymmetrical boxplot). Let us try ou, how
a scale transformation (e.g. logarithm) is changing that:
scatterplot.matrix(~Ca+Ti+K+Rb+Sr+log10(Mn)+Fe, data=example2,
diag="boxplot")
7.2 Cu
It has become apparent from our previous analysis that a linear regression model provides a good
way of describing the scaling properties of the data. However, we may wish to check whether the
data could be equally-well described by a polynomial fit of a higher degree.
Sokal and Rohlf (1997) present an unpublished data set in which the nature of the relationship
between Lap94 allele (=group of genes) frequency in Mytilus edulis (blue mussel) and distance (in
miles) from Southport was investigated.
We import the mytilus data set using the import function of Rcmdr (mytilus.csv, separator=,)
Sokal and Rohlf (1997) transformed frequencies using angular transformations (arcsin
transformations). Hence, we also have to transform the Lap94 data using
asin(sqrt(LAP))*180/pi
We then have to show that a simple linear regression does not adequately describe the relationship
between Lap94 and distance by examining a scatterplot and a residual plot.
scatterplot
scatterplot(asin(sqrt(LAP))*180/pi ~DIST, data=mytilus)
residual plot:
plot(lm(asin(sqrt(LAP))*180/pi ~DIST, data=mytilus), which=1)
We try to fit a polynomial regression (additive multiple regression) model incorporating up to the
fifth power (5th order polynomial)
Note that trends beyond a third order polynomial are unlikely to have much
biological basis and are likely to be over-fit. This is also true for most
geoscientific applications.
Coefficients:
(Intercept) DIST I(DIST^2) I(DIST^3) I(DIST^4)
I(DIST^5)
2.224e+01 1.049e+00 -1.517e-01 6.556e-03 -1.033e-04 5.518e-
07
plot(mytilus.lm5, which=1)
Conclusion 2 no wedge pattern of the residuals (see figure XX in chapter 8.3.1), suggesting the
homogeneity of variance and that the fitted model is appropriate.
Now, we want to examine the fit of the model with respect to the contribution of the different
powers:
anova(mytilus.lm5)
What is already stated as information above is here put into numbers: powers of distance beyond
a cubic (third order, x) do not make significant contributions to explain the variation of this data
set.
For evaluating the contribution of an additional power (order) we can compare the fit of higher
order models against models one lower in order.
Conclusion 3 the third order model (lm3) fits the data significantly better that a second order
model (lm2) (P=0.018), while the second order model is not really better than a linear model (lm1)
(P=0.087).
Hence, we focus on the third order model and estimate the model parameters from the summary:
summary (mytilus.lm3)
Call:
lm(formula = asin(sqrt(LAP)) * 180/pi ~ DIST + I(DIST^2) + I(DIST^3),
data = mytilus)
Residuals:
Min 1Q Median 3Q Max
-6.1661 -2.1360 -0.3908 1.9016 6.0079
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 26.2232524 3.4126910 7.684 3.47e-06 ***
DIST -0.9440845 0.4220118 -2.237 0.04343 *
I(DIST^2) 0.0421452 0.0138001 3.054 0.00923 **
I(DIST^3) -0.0003502 0.0001299 -2.697 0.01830 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Coefficient of determination
r is euqal to the square of the correlation coefficient only within simple
linear regression
r = (SSreg / Sstot) is reflecting the explained variance
Conclusion 4 there was a significant cubic (third order) relationship between the frequency of the
Lat94 allele and the distance from Southport. The final equation of the regression is:
arcsin(sqrt(LAT))=26.2233-0.944*DIST+0.042*dist-0.0003*dist
The data set consist of synthetic data resembling the barium content (in wt.%) down a sediment
core (in meters)
()( )
x x1 x2 .. x n
y = y1 y2 .. y n
z z1 z2 .. z n
.. .. .. .. ..
2 +5 7
So the Jaccard distance in the above example would be d Jaccard ( A , B)= = meaning
10 10
that 70% of the features occur only in one of the objects.
In R we can use the dist() function
DISTANCEMATRIX=dist(DATASET, method="")
to compute the distances between several objects. dist() calculates the distance between
the rows of a matrix, so make sure your DATASET has the right format. The set of
comparison results you get back is called distance matrix. method="binary" gives you the
binary (Jaccard) distance. In R the input vectors are regarded as binary bits, so all non-zero
elements are on and zero elements are off. In these terms the distance can be seen as the
proportion of bits in which only one is on amongst those in which at least one is on, which
is an equivalent definition to the one given above.
Another improtant similarity measure often used in ecology is Bray-Curtis dissimilarity. It
compares species counts on two sites by summing up the absolute differences between the
counts for each species at the two sites and dividing this by the sum of the total abundances
in the two samples. The general formula for calculating the Bray-Curtis dissimilarity
between samples A and B is as follows, supposing that the counts for species x are denoted
by nAx and nBx:
m
n Ax n Bx
x=1
d Bray Curtis ( A , B)= m
(n Ax +n Bx)
x=1
This measure takes on values between 0 (samples identical: nAx= nBx . for all x) and 1 (samples
completely disjoint).
Exercise 27: You want to compare how similar two aquariums are. Calculate
Jaccard distance and Bray-Curtis dissimilarity with the formulas given above
and the data given below.
Number of
indiviudals in
Aquarium 1 3 2 4 6
Aquarium 2 6 0 0 11
If we have quantitative data the two most common distances used are Euclidean distance
and Manhattan (city-block) distance. Let us look at an example: We have analyzed two
different rocks for their content in calcium and silicium and we find the following
(silicium ) ( ) ()
r 12 1 silicium( r 22 2 )( )()
R1= calcium = r 11 = 0 a.u. and R2 = calcium = r 21 = 2 a.u. (a.u.=arbitrary units).
If we plot calcium against silicium (Figure 28) we can see two points which represent the
two different rocks. How different are they? A very intuitive way to think of the distance
between these two points is the direct connection between them (continuous line). This
distance is called euclidean distance and can be easily calculated with the Pythagorean
theorem; d (R1, R2 )=(r 11 r 21)2 +(r 12 r 22 )2 =5 , as you know from school. Another way
to calculate the distance is to follow the dotted lines as if we walked around in
Manhattan. Then we get the Manhattan (city-block) distance:
d (R1, R2 )=r 11 r 21+r 12 r 22=3 . Obviously, the obtained distances are not the same, the
distance between two objects very much depends on how you measure it.
Euclidean: d ( x , y)= ( x k y k )2 . If n=2 we have the case we saw in the example before.
k=1
n
Manhattan: d ( x , y)= x k y k . If n=2 we have the case we saw in the example before.
k=1
Exercise 28: Calculate the Manhattan and euclidean distance between the two
objects a=(1,1, 2, 3) and b=( 2, 2,1, 0) . One way to solve this is to
use R as a normal calculator applying the formulas above (or do it in your
head). The second one is to create two vectors a=c(a1,a2,a3,a4) and b and
use rbind(Vector1, Vector2)to combine them to a matrix. Then you can use
the dist() function.
A simple algorithm for agglomerative hierarchical clustering could look like this:
1. Initially all objects you want to cluster are alone in their cluster
2. Calculate the distances between all clusters using your linkage method
3. Join the closest two clusters
4. Go back to step 2 until all objects are in one comprehensive cluster
8.2.3 Clustering in R
In R we can use the function hclust(DISTMATRIX, method="") of the stats package. With
DISTMATRIX being a distance matrix obtained by the dist() function and method="" being
one of the following: "single", "complete" or "ward". Further linkage methods exist but
are now of no concern for us, you can look them up under ?hclust.
Let's get started:
Import the dataset PMM.txt in R. The variables are measured in grossly different ranges.
This might result in faulty clustering (try it if you want). Therefore we want to standardize
the range of the data first. We can use decostand() of the package vegan.
PMMn=decostand(PMM,"range")
range sets the highest value of each variable to 1 and scales the other accordingly so
they are now percentage of maximum.
Have a look at the dataframe. We are now interested in clustering the elements and not the
observations. The objects we want to cluster have to be in the rows of the dataframe.
Therefore we need to transpose our datamatrix (change rows and columns). This is done by
PMMnt=t(PMMn)
Now let us calculate the distances between the elements. Since these are measured on a
ratio scale it makes sense to use euclidean distances.
distm=dist(PMMnt, method="euclidean")
Now we use this distance matrix as input for our clustering. Let us first use single as
clustering method.
ahclust=hclust(distm, method="single")
A graphical ouptut can easily be obtained by plot(ahclust). This gives you a dendrogram
where we can see how closely the observations are related. The length of branches shows
the similarity of the objects. You can see that a lot of elements are added to existing
clusters in a stepwise fashion i.e. one after the other. This is a peculiarity of the single-
linkage method.
If you have a lot of objects, presenting the result as dendrogram is not very pretty anymore.
It is more useful to know the assignment of each objects to a certain cluster at a certain
number of clusters present. For that we use the R function cutree(DATA,#CLUSTERS). The
result is a vector with as many components as we have objects and each component tells
you the cluster for that object. If we want to divide our data into two groups we therefore
can now use
ahclust_2g=cutree(ahclust,k=3)
to get the assignment of each object to one of these groups. You can type the variable name
ahclust_2g to get some idea about this assignment.
Exercise 29: Try out the other linkage methods. Are there significant
differences? Check by plotting all the dendrograms into one big graph (see
Chapter 4.4 Combined figures if you forgot how to do that).
BONUSPOINTS: Cluster the non-standardized dataset PMM. Do the results
make sense?
Exercise 31: The districts of the Baltic can be grouped by composition of the
algae species. In Error: Reference source not found you see the different
sites on a map. Cluster the sites in the dataset algae_presence.csv with
agglomerative hierarchical clustering and a linkage method of your choice.
Use the presence/absence of species for classification. What distance
measure should you use? Look at the dendrogramm. Do the results make
sense?
Repeat the exercise after you did a Beals transformation (see the following
infobox or ?beals) of the data. What distance measure should you use? Do
your results make more sense?
Beals transformation:
Beals smoothing is a multivariate transformation specially designed for
species presence/absence community data containing noise and/or a lot
of zeros. This transformation replaces the observed values (i.e. 0 or 1)
of the target species by predictions of occurrence on the basis of its
co-occurrences with the other remaining species (values between 0 and
1). In many applications, the transformed values are used as input for
multivariate analyses.
In R Beals transformation can be performed with the beals() function of the
vegan package.
8.5 P
Cluster analysis can not be regarded as objective statistical method because:
The choice of similarity index is done by the user.
Each different linkage procedure gives different results.
The number of groups is chosen by the researcher.
Further reading:
Afifi, May, Clark (2012): Practical Multivariate Analysis, CRC Press. Chapter
16: Cluster Analysis
A good introduction which is well understandable but more in-depth than
in this script.
http://www.econ.upf.edu/~michael/stanford/maeb7.pdf
Explanation of hierarchical clustering with examples.
bio.umontreal.ca/legendre/reprints/DeCaceres_&_Legendre_2008.pdf
A discussion about Beals transformation
8.6 R code library for cluster analysis
Function Arguments Use
library(stats) stats contains many basic
statistical tool
library(vegan) vegan contains specific tools for
ecologists
dist(x, method="") x: a numeric matrix, data frame or "dist" object. Calculate the distance between the
method: the distance measure to be used. rows of a matrix and returns a
distance matrix.
Must be "euclidean", "maximum", "manhattan",
"canberra", "binary" or "minkowski".
decostand( x, method) x: community data in a matrix standardization
method: the standardization method. E.g.
'normalize', 'standardize', 'range'.
See ?decostand for details.
t() Transposes a matrix
hclust(x, method="") d: a dissimilarity structure (distance matrix) as Agglomerative hierarchical
produced by dist. clustering
method: the agglomeration method to be used.
This should be one of "ward", "single",
"complete", "average", or others.
cutree(tree, k = , h = ) tree: a tree as produced by hclust. Cuts a tree created by
hierarchical clustering at a certain
k: desired number of groups
heigth or cluster number.
h: heigth where the tree should be cut.
at least one of k or h must be specified, k
overrides h if both are given.
kmeans(x, centers) x: your input, has to be a numeric matrix of Kmeans clustering
data
centers: the number of clusters, say k.
beals(x) x: input community data frame or matrix. Performs a beals transformation of
the data.
Further parameters and details can be looked
up with ?beals.
9 Or
One of the most challenging aspects of multivariate data analysis is the sheer complexity of
the information. If you have a dataset with 100 variables, how do you make sense of all the
interrelationships present? The goal of ordination methods is to simplify complex datasets
by reducing the number of dimensions of the objects. Recall, that in the cluster analysis part
we defined objects with features as vectors with components. These objects can be thought
of as points in a n-dimensional space with the values of the respective components giving
you the coordinates on n different coordinate axes. Up until n=3, this is easily conceivable
but it works exactly in the same for n>3.
The easiest way of dimension reduction would be to only consider one variable, e.g. the first
component of each vector and discard the rest for your analysis. This is of course not very
reasonable because you will lose a lot of information. Therefore different ordination
techniques have been developed that minimize the distortion of such a dimension
reduction. We will focus on two of these methods: Principal Component Analysis (PCA) and
non-metric multidimensional scaling (NMDS).
We need to introduce two more definitions that are used in discussing the results of a PCA.
The first is component scores, sometimes called factor scores. Scores are the transformed
variable values corresponding to a particular data point i.e. its new coordinates. Loading is
the weight by which each original variable is multiplied to get the component score. The
loadings tell you about the contribution of each original variable to a PC, so a high loading
means the variable determines the PC to a large extent.
The exact mathematical reasoning and procedure of PCA shall be of no concern for us here.
We want to focus more on the application and interpretation of the results, so let's rather
get started in R.
Further reading:
http://yatani.jp/HCIstats/PCA
a simple explanation of PCA which also explains how to interpret the
results.
http://strata.uga.edu/software/pdf/pcaTutorial.pdf
well comprehensible, more advanced description
http://ordination.okstate.edu/overview.htm
PCA and other ordination techniques for ecologists.
9.1.2 PCA in R
There are several possibilities to perform a PCA in R. We use a basic function from the stats
package: princomp(DATASET, cor=TRUE). cor specifies if the PCA should use the covariance
matrix or a correlation matrix. As a rough rule, we use the correlation matrix if the scales of
the variables are unequal. This is a conscious choice of the researcher!
An alternative to princomp() is the command principal() from the psych
package.
Let us work again with a dataset we already know and love: PMM.txt. Load the dataset, and
then we can use
PMM_pca <- princomp(PMM, cor=TRUE)
to carry out a complete PCA and get 15 principal components, their loadings and the scores
of the data. The first step of a PCA would be to calculate a covariance or correlation matrix.
However, the function will calculate it for us and we can use our raw data as input.
A basic summary of our analysis can be obtained by
print(PMM_pca)
summary(PMM_pca)
To get an idea about the data it is common to plot the scores of the 1st PC against the scores
for the 2nd. You could simply plot(pca$scores[,1:2]) but a nicer output can be achieved by:
plot(PMM_pca$scores[,c(1,2)], pch=20)
text(PMM_pca$scores[,1],PMM_pca$scores[,2])
abline(0,0); abline(v=0)
To get an overview, we can create a scatterplot matrix, for example like this:
pairs(PMM_pca$scores[,1:4], main="Scatterplot Matrix of the scores of
the first 4 PCs")
We will get a scatterplot matrix of all these components against each other.
Since we want to know which variables have the greatest influence on our data, we want to
have a look at the loadings of the PCs. One way to do this is to just type the variable name:
PMM_pca$loadings
A graphical representation can be obtained by:
barplot(PMM_pca$loadings[,1], ylim=c(-0.5,0.5),ylab="Loading",
xlab="Original Variables", main="Loadings for PC1")
which shows which elements have the highest influence on the first PC.
A very common display for PCA results with scores as points in a coordinate system (e.g.
first and second component) and the loadings as vectors in the same graph is called a biplot.
The biplot is easily obtained:
biplot(PMM_pca, choices=1:2); abline(0,0); abline(v=0)
choices selects the PCs to plot. It is quite useful in analysing and interpreting the results.
Finally, we can have a look at the scree plot, plotting eigenvalues or variance agains the PC
number:
plot(PMM_pca, type="lines")
We see that the most information is in the first component and from the 6th onward there
is not much information in the component anymore. From the summary we know that the
first 6 components explain 90% of the variance. In the next part we will see methods how to
determine which PCs are still useful for further analysis.
Exercise 32: Repeat the plots above, but this time looking at the
relationship between the 1st and the 3rd Principal Component.
The mai
sex of the crabs based on these five morpho
would like to have one si
fi
subtasks:
1. View your data. From univariate box-plots assess whether any i
variable is sufficient for discrimi s or s .
Possibility 1, Old schoo
your plots (10!) clearly arranged i
plot(a,b). Possibility 2: Fas melt the data set, then
ggplot+facet.grid
2. Ho signifi difference?
Tes
groups.
BONUSPOINTS: Create a scatterplot matrix of all measured variables
agai
co to co
to. Alternative: GGally::ggpairs().
Does this help us i
3. Perform a PCA on the dataset. Si
the same scale we use the PCA with the correlation matrix.
4. Plot the scores of the firs
dis
Create a plot of the scores of the firs
Create a scatterplot matrix of the scores of the firs
ones dis
Can we determi
PC?
Hi
col=australian.crabs$group
5. Look at the loadi
that?
Create a plot of the loadi
have the highes
Create a biplot for PC1 an
give an i
6. Ho for further analy
7. BONUS: Your co
variables FL=0.91, RW=0.62, CL=0.81, CW=0.86, BD=0.90, but forgot to
write down whi
Hint: predict()
8. BONUS: Pe rform a clus
sam
9.1.4 Pr
Principle Component Analysis is not suited for all data. The main problem is that it assumes
a linear correlation between the observations and their variables. This is often not justified,
especially in ecology. Species for example often show a unimodal behavior towards
environmental factors. Other ordination techniques exist which might be more suitable.
One alternative is to use higher order PCA, polynomial PCA. Another possibility are
techniques that are summarized under the term multidmensional scaling (MDS) which will be
covered in the next part of this chapter.
9.2.1 Princip
You start out with a matrix of data consisting of n rows of samples and p columns of
variables, such as taxa for ecological data. From this, a n x n distance matrix of all pairwise
distances among samples is calculated with an appropriate distance measure, such as
Euclidean distance, Manhattan distance or, most common in ecology, Bray-Curtis distance.
The NMDS ordination will be performed on this distance matrix. In NMDS, only the rank
order of entries in the distance matrix (not the actual dissimilarities) is assumed to contain
the significant information. Thus, the purpose of the non-metric MDS algorithm is to find a
configuration of points whose distances reflect as closely as possible the rank order of the
origianl data, meaning that the two objects farthest apart in the original data should also be
farthest apart after NMDS and so on.
Next, a desired number of m dimensions is chosen for the ordination. The MDS algorithm
begins by assigning an initial location to each item of the samples in these m dimensions.
This initial configuration can be entirely random, though the chances of reaching the
correct solution are enhanced if the configuration is derived from another ordination
method. Since the nal ordination is partly dependent on this initial conguration, a
program performs several ordinations, each starting from a different random arrangement
of points and then select the ordination with the best t, or applies other procedures in
order to avoid the problem of local minima.
Distances among samples in this starting conguration are calculated and then regressed
against (compared with) the original distance matrix. In a perfect ordination, all
ordinated distances would fall exactly on the regression, that is, they would match the rank-
order of distances in the original distance matrix perfectly. The goodness of t of the
regression is measured based on the sum of squared differences between ordination-based
distances and the distances predicted by the regression. This goodness of t is called stress.
It can be seen as the mismatch between the rank order of distances in the data, and the rank
order of distances in the ordination. The lower your stress value is, the better is your
ordination.
The conguration is then improved by moving the positions of samples in ordination space
by a small amount in the direction in which stress decreases most rapidly. The ordination
distance matrix is recalculated, the regression performed again, and stress recalculated.
This entire procedure of nudging samples and recalculating stress is repeated until the
stress value seems to have reached a (perhaps local) minimum.
Furthe r reading:
http://s
Excelle
and the app
text.
http://www.unesco.org/w ebworld/idams/advg
Pr
level.
http://ordination.oks
NMDS and othe r ordination techniques for eco
9.2.2 NMDS in R
The function we want to use in R is called metaMDS of the package vegan. In order to
perform NMDS we first need to calculate the distance between items. metaMDS is a smart
function and will take on this task for you as well using vegdist. However, if you want to
scale your data and calculate the distance using a different function, metaMDS also accepts
a distancematrix as an input.
The vegan package i
s
i
by default. For non-eco
ordination.
One a
function in the MASS package.
So our work in R is rather easy. We load our forest dataset by (watch out for the correct
directory path!)
forests<-read.csv("forests.csv", header=TRUE, row.names=1)
Since metaMDS is a complex functions there are a lot of possible parameters. You will want
to check
?metaMDS
to see what possible parameters there are. The columns of the dataset should contain the
variables and the rows the samples. In our dataset this is the other way around, so we still
need to transpose it:
t_forests=t(forests)
Now a simple NMDS analysis of our dataset with the default settings could look like this:
def_nmds_for=metaMDS(t_forests)
We might wish to specify some parameters:
nmds_for=metaMDS(t_forests, distance = "euclidean", k = 3,
autotransform=FALSE)
distance is the distance measure used (see 8.1 Measures of distance), k is the number of
dimensions, autotramsform specifies if automatical transformations are turned on or off.
You can see which objects the metaMDS function returns by
names(nmds_for)
the important ones are
nmds_for$points #sample/site scores
nmds_for$species #scores of variables (species / taxa in ecology)
nmds_for$stress #stress value of final solution
nmds_for$dims #number of MDS axes or dimensions
nmds_for$data #what was ordinated, including any transformations
nmds_for$distance #distance metric used
We can view which parametes were used by writing the output variable name:
nmds_for
Important for us are the sample and variable scores, which we can extract by
variableScores <- nmds_for$species
sampleScores <- nmds_for$points
The column numbers correspond to the MDS axes, so this will return as many columns as
was specied with the k parameter in the call to metaMDS.
We can obtain a plot by:
plot(nmds_for)
Sites/samples are shown by black circles, the taxa by red crosses.
MDS plots can be customized by selecting either "sites" or "species" in display=, by
displaying labels instead of symbols by specifying type="t" and by choosing the dimensions
you want to display in choices=.
plot(nmds_for, display = "species", type = "t", choices = c(2, 3))
Because of the limited time available for this subject we will focus on the practical aspects
of spatial analysis, i.e. things you might need if you add maps to your statistical project or
final thesis. This includes mainly import of vector and raster maps, plotting of maps and
statistical analyses.
First, we need to define the different types of spatial data
Point data, e.g. the location and one or more properties like the location of a tree and
its diameter. Normally this type is considered the simplest case of a vector file, but
we treat it separately, because mapping in ecology means frequently going out with
a GPS and writing down (or recording) the position and some properties (e.g. species
composition, occurrence of animals, diameter of trees...)
Vector data with different sub-species like a road or river map (normally coming
from a vector GIS like ArcGIS).
Grid-Data or raster data are files with a regular grid like digital images from a camera,
a digital elevation model (DEM) data or the results of global models.
Point is possibly the most frequent application for ecologists. Typically, positions are
recorded with a GPS device and then listed in Excel or even as text.
The procedure in R to convert point data to a internal or ESRI-map is straightforward:
read in the data
define the columns containing the coordinates
convert everything to a point shapefile
Following is a brief R script that reads such records from a CSV file, converts them to the
appropriate R internal data format, and writes the location records as an ESRI Shape File.
The fileLakes.csv contains the following columns. 1: LAKE_ID, 2: LAKENAME, 3:
Longitude, 4: Latitude. For compatiblility with ArcMap GIS, Longitude must appear
before Latitude.
library(sp)
library(maptools)
LakePoints = read.csv("Lakes.csv")
Columns 3 and 4 contain the geographical coordinates.
LakePointsSPDF =
SpatialPointsDataFrame(LakePoints[,3:4],data.frame(LakePoints[,1:4]))
plot(LakePointsSPDF)
Now write a shape-file for ESRI Software.
maptools:::write.pointShape(coordinates(LakePointsSPDF),data.frame(Lak
ePointsSPDF),"LakePointsShapeRev")
writeSpatialShape(LakePointsSPDF,"LakePointsShapeRev2")
10.2.1 Bubble p
A quite useful chart type is the spatial bubble plot the size of the bubble is proportional to
the value of the variable
library(sp)
library(lattice)
#data(meuse)
#coordinates(meuse) = ~x+y
## bubble plots for cadmium and zinc
data(meuse)
coordinates(meuse) <- c("x", "y") # promote to SpatialPointsDataFrame
bubble(meuse, "cadmium", maxsize = 1.5, main = "cadmium concentrations
(ppm)", key.entries = 2^(-1:4))
bubble(meuse, "zinc", maxsize = 1.5, main = "zinc concentrations
(ppm)", key.entries = 100 * 2^(0:4))
To show you how maps are used for statistics we want to find out the land use type on steep
slopes.
slopegrd = readGDAL("slope.asc")
slope=raster(slopegrd)
spplot(slope)
hist(slope)
Extract all cells with slope >4
steep = slope>4
Multiply with land use multiplication with 0 is 0, for 1 the value of land use is taken.
lu_steep = steep * lu87
Finally, count the different classes
freq(lu_steep)
37: Ca
elevations > 1000m (code for fores
increasing fores
Hints: us
Unfortunately, R is not very suitable for vector data, therefore we suggest that you prepare
the vector files as far as possible with a real GIS. If you really want to take a close look at
vector maps in R you can read the following help files and the book from Blivand et al. 2008.
Because R is not good in vector maps we convert everything to a raster. First, we define size
and extent of the new raster map
rasTemplate <- raster(ncol=110, nrow=110, crs=as.character(NA))
extent(rasTemplate) <- extent(vecLandcover)
The final conversion is
rasLandcover <- rasterize(vecLandcover, rasTemplate, field="GRIDCODE")
The field="GRIDCODE" part defines the variable which contains the code for the land use.
rasBuildings <- rasterize(vecBuildings, rasTemplate)
rasRoads <- rasterize(vecRoads, rasTemplate)
rasRivers <- rasterize(vecRivers, rasTemplate)
Final, control the result with a plot
plot(rasLandcover)
plot(rasBuildings)
plot(rasRoads)
plot(rasRivers)
A simple application of map operations is e.g. the creation of a buffer zone around streets or
buildings. This can be done with the edge function which draws a line around the edges of
a raster
ras2 <- boundaries(rasRoads, type="outer")
but you can check with
plot(ras2)
that only the edges are drawn. To add one map to the other we use
rasRoads2 <- cover(rasRoads, ras2)
You can also join the commands above to one:
ras2=raster(rasBuildings,layer=2)
ras3 = boundaries(ras2, type="outer")
# falscher Datentyp
rasBuildings <- cover(rasBuildings, ras3)
rasBuildings <- cover(rasBuildings, edge(rasBuildings, type="outer"))
rasRoads <- cover(rasRoads, edge(rasRoads, type="outer"))
The nal step is to combine the buildings, roads, rivers, and landcover rasters into one. We
will cover the landcover raster with the other three.
Examining the rasBuildings plot, you will notice that the roads are assigned a value of 1
and non-roads are assigned a value of 0. In order to cover one raster over the another, we
need to set these 0 values to NA. On a raster NA implies that a cell is transparent. So lets do
this for all the covering rasters:
rasBuildings[rasBuildings==0] <- NA
rasRoads[rasRoads==0] <- NA
rasRivers[rasRivers==0] <- NA
The features on each of these three rasters have a value of 1. In order to dierentiate these
features on the nal raster we need to give each feature a dierent value. Recall that our
landcover classes are 0 to 4. Lets set rivers to 5, buildings to 6, and roads to 7. It seems to
be standard practise to use a continuous set of integers when creating feature classes on
rasters.
rasRivers[rasRivers==1] <- 5
rasBuildings[rasBuildings==1] <- 6
rasRoads[rasRoads==1] <- 7
And now we can combine these using the cover function, with the raster on top rst, and
the raster on bottom last in the list:
patchmap <- cover(rasBuildings, rasRoads, rasRivers, rasLandcover)
The attributes of a map are stored in the so called slots. You get a list with
slotNames(myLanduse)
The slot we are interested in is data
str(myLanduse@data)
where you find all attributes of the map. You can manipulate these variables as usual, e.g.
myLanduse@data[1,]
If you want to manipulate or select data you can
attach(myLanduse@data)
myLanduse@data[GRIDCODE==1,1]
You could also use the rgdal library
library(rgdal)
myLand2 <- readOGR(dsn="landuse.shp"",layer="landuse")
11.1 Definitio
Time Series: In statistics and signal processing, a time series is a sequence of data points,
measured typically at successive times, spaced at (often uniform) time intervals. Time series
analysis comprises methods that attempt to understand such time series, often either to
understand the underlying theory of the data points (where did they come from? what
generated them?), or to make forecasts (predictions). Time series prediction is the use of a model
to predict future events based on known past events: to predict future data points before they
are measured. The standard example is the opening price of a share of stock based on its past
performance.
Trend: In statistics, a trend is a long-term movement in time series data after other
components have been accounted for.
Amplitude: The amplitude is a non negative scalar measure of a wave's magnitude of
oscillation
Frequency: Frequency is the measurement of the number of times that a repeated event
occurs per unit of time. It is also defined as the rate of change of phase of a sinusoidal
waveform. (Measured in Hz) Frequency has an inverse relationship to the concept of
wavelength.
Autocorrelation is a mathematical tool used frequently in signal processing for analysing
functions or series of values, such as time domain signals. Informally, it is a measure
of how well a signal matches a time-shifted version of itself, as a function of the
amount of time shift (the Lag). More precisely, it is the cross-correlation of a signal
with itself. Autocorrelation is useful for finding repeating patterns in a signal, such as
determining the presence of a periodic signal which has been buried under noise, or
identifying the missing fundamental frequency in a signal implied by its harmonic
frequencies.
Period: time period or cycle duration is the reciprocal value of frequency: T = 1/frequency
Name Content
Date Date
Peff Effective precipitation (mm)
Evpo_Edry Evaporation from dry alder carr (mm)
T_air Air temperature (C)
Sunshine Sunshine duration (h)
Humid_rel Relative Humidity (%)
H_GW Groundwater level (m)
H_ERLdry Water level in dry part of alder carr (m)
H_ERLwet Water level in wet part of alder carr (m)
H_lake Water level in Lake Belau (m)
Infiltra Infiltration into the soil (mm)
11.3 Dat
If the factors are defined, you can use the following function to create all kinds of
summaries (sums, mean....).
aggregate(T_air, list(n = months), mean)
With the dplyr library, the calculation of mean values is quite straigtforward
t_annual=dplyr::group_by(t,years)
airtemp=dplyr::summarise(clim_group,
mean_t=mean(AirTemp_Mean),
median=median(AirTemp_Mean))
qplot(as.numeric(years),mean_t,data=t_ann_mean,geom="line")
qplot(as.numeric(years),sum_prec,data=t_ann_mean,geom="line")
38: create b
and gr
39: ca
40: create a scatte rp
l ake wate r levels (hint: us
conve rsions, us
11.4.1 St
In statistics, TS are composed of the following subcomponents:
Y t=T t +S i + Rt
where
T = Trend, a monotone function of time t
S = one or more seasonal component(s) (cycles) of length/duration i
R = Residuals, the unexplained rest
The analysis of TS is entirely based on this concept. The first step is usually to detect and
eliminate trends. In the following steps, the cyclic components are analysed. Sometimes,
the known seasonal influence is also removed.
11.4.2 Trend Analysis
Normally, trend analysis is a linear or non-linear regression analysis with time as x-axis or
independent variable. Many authors also use the term for the different filtering algorithms
which are normally used to make plots of data look more smoothly.
41: use a linear model to remove the trend from the air temperature
(Hint: function lm, look at the contents of the results)
11.4.2.2 Filter
Some TS show a high degree of variation and the real information my be hidden in the high
variation of the data set. This is why there are several methods of filtering or smoothing a
data set. Sometimes this process is also called low pass filtering, because it removes the
high pitches from a sound-file and lets the low frequencies pass. The most frequently used
methods are splines and moving averages. Moving average are computed as mean values of
a number of records before and after the actual value. The range of averaging decides on
the smoothness of the curve. Is filtering is used to remove trends, detrended means the
deviations from the moving averages.
11.7 TS in R
First, we have to define the data set as a time series.
attach(t)
lake = ts(H_lake, start=c(1989,1),freq=365)
Next we can already plot an overview of the analysis
ts = stl(lake,s.window="periodic")
plot(ts)
or look at the text summary:
summary(ts)
A look at the structure of the results
str(ts)
reveils that you can extract the detrended and deseasonalized remainders with
clean_ts = ts$time.series[,3]
for further analysis. Please take a look at the help-page of the procedure to understand
what happens below the surface.
For time series analysis we often need so called lag variables, i.e. the data set moved back
or forth a number of timesteps. A typical example is e.g. Unit-Hydrograph, which compares
the actual discharge to the effective precipitation of a number of past days. This number is
called lag. You can create the corresponding time series with the lag function:
ts_test = as.ts(t$H_GW) # Groundwater
lagtest <- ts_test # temp var
for (i in 1:4) {lagtest <- cbind(lagtest,lag(ts_test,-i))}
Now check the structure and the content of lagtest.
43: analyse groundwater water level (detrend, remove seasonal
trends)
44: analyse the influence of water level in the wet part and the lake
(H_ERLdry) and groundwater level (H_GW)
45: analyse the autocorrelation of different nutrients from the
wqual.data (see page 117 for a description)
Next, we can use a different approach with a different scaling. The base period is now 365
days, i.e. frequency of 1 means one per year.
air =read.csv("http://www.hydrology.uni-
kiel.de/~schorsch/air_temp.csv")
airtemp = ts(T_air, start=c(1989,1), freq = 365)
spec.pgram(airtemp,xlim=c(0,10))
To compute the residuals, we use the information from spectral analysis to create a linear
model.
x <- (1:3652)/365
summary(lm(air$T_air ~ sin(2*pi*x)+cos(2*pi*x)+ sin(4*pi*x)
+cos(4*pi*x) + sin(6*pi*x)+cos(6*pi*x)+x))
46: analyse the periodogram of the lake water level before and after
the stl analysis
Hints
prepare the figures step by step
use aggregate to calculate the annual and monthly summaries
Cloud_Cover
RelHum
Mean_Temp
Airpressure
Min_Temp_5cm
Min_Temp
Max_Temp
prec
sunshine
snowdepth
48: Analyse the slope of the different variables. Is there a significant increase?
14 Solutio
Solution 2:
Climate$Year_fac = as.factor(Climate$Year)
Climate$Month_fac = as.factor(Climate$Month)
First Version:
Climate$Summer = 0
Climate$Summer[Climate$Month>5 & Climate$Month<10]=1
Second Version:
Climate$Summer = (Climate$Month>5) & (Climate$Month<10)
The result is a boolean variable
Solution 10:
plot(Mean_Temp ~ Date, type="l")
lines(Max_Temp ~ Date, type="l", col="red")
lines(Min_Temp ~ Date, type = "l", col="blue")
Solution 12:
m2 = (Max_Temp+Min_Temp)/2
scatterplot(Mean_Temp ~ m2)
scatterplot(Mean_Temp ~ m2| Year_fac)
Solution 37:
ue1000 =dem>1000
> t2=ue1000*dem
> spplot(t2)
> ue1000b=(dem>1000) * dem
forest87=lu87==1
forest07=lu07==1
ue1000 =dem>1000
forest87a=forest87*ue1000
forest07a=forest07*ue1000
# increase: 87=1, 07=0
diff87_07 = (forest87a ==1) & (forest07a == 0)
spplot(diff87_07)
summary(diff87_07)
Cells: 770875
NAs : 378939
Mode "logical"
FALSE "384320"
TRUE "7616" Decrease
NA's "378939"
# increase 87=0, 07 = 1
diff07_87 = (forest87a ==0) & (forest07a == 1)
spplot(diff07_87)
summary(diff07_87)
Cells: 770875
NAs : 378943
Mode "logical"
FALSE "370912"
TRUE "21020" increase
NA's "378943"
# any spatial patterns?
diff= diff87_07-diff07_87
spplot(diff)
Solution 38:
boxplot (H_lake ~ months)
boxplot (H_GW ~ months)
boxplot (H_lake ~ years)
boxplot (H_GW ~ years)
Solution 39:
aggregate(H_lake, list(n = months), mean)
Solution 43:
gw = ts(H_GW, start=c(1989,1),freq=365)
plot(stl(gw,s.window="periodic"))
Solution 44:
ccf(H_ERLwet, H_lake, lag.max=365, plot=TRUE)
ccf(H_ERLwet, H_GW, lag.max=365, plot=TRUE)
Solution 46: