Академический Документы
Профессиональный Документы
Культура Документы
R (and QGIS)
Nov 28, 2015
The goal of this post is to demonstrate the ability of R to classify multispectral imagery using
RandomForests algorithms. RandomForests are currently one of the top performing algorithms
for data classification and regression. Although their interpretability may be difficult,
RandomForests are widely popular because of their ability to classify large amounts of data with
high accuracy.
In the sections below I show how to import into R a Landsat image and how to extract pixel data
to train and fit a RandomForests model. I also explain how to speed up image classification
through parallel processing. Finally I demonstrate how to implement this R-based
RandomForests algorithms for image classification in QGIS.
For the purpose of this post, Im going to conduct a land-cover classification of a 6-band Landsat
7 image (path 7 row 57) taken in 2000 that has been processed to surface reflectance, as shown
in a previous post in my blog. Several R packages are needed, including: rgdal, raster, caret,
randomForest and e1071. After installation, lets load the packages:
library(rgdal)
library(raster)
library(caret)
Now lets import the Landsat image into R as a RasterBrick object using the brick function
from the raster package. Also lets replace the original band names (e.g., X485.0.Nanometers)
with shorter ones (B1 to B5, and B7):
img <-
brick("C:/data/landsat/images/2000/LE70070572000076EDC00/L7007057_20000316_ref
l")
names(img) <- c(paste0("B", 1:5, coll = ""), "B7")
We can make a RGB visualization of the Landsat image in R using the plotRGB command, for
example, a false color composite RGB 4:5:3 (Near infrared - Shortwave infrarred - Red). Im
using the expression img * (img >= 0) to convert the negative values to zero:
Now lets extract the pixel values in the training areas for every band in the Landsat image and
store them in a data frame (called here dfAll) along with the corresponding land cover class id:
The data frame resulting from working with my data has about 80K rows. It is necessary to work
with a smaller dataset as it may take a long time to train and fit a RandomForests model with a
dataset this size. For a start, lets subset the data generating 1000 random samples:
Next we must define and fit the RandomForests model using the train function from the caret
package. First, lets specify the model as a formula with the dependent variable (i.e., the land
cover types ids) encoded as factors. For this exercise Ill only use three bands as explanatory
variables (Red, Near infrared and Short wave infrared bands). We then define the method as rf
which stands for the random forest algorithm. (Note: try names(getModelInfo()) to see a
complete list of all the classification and regression methods available in the caret package).
beginCluster()
preds_rf <- clusterR(img, raster::predict, args = list(model = modFit_rf))
endCluster()
For running the QGIS version of the R script described above, you can download the script
available in the following link and save it in the R Scripts folder (or copy and paste the content in
the QGIS Script editor) as explained in my previous post:
Watch the following video to see how to perform a RandomForests classification for a Landsat
image in QGIS using R packages:
Additional resources
For digging into the process of predictive models creation, I recommend you visit the caret
package website which provides extensive documentation about data preprocessing, data
splitting, variable importance evaluation and model fitting and tuning. Also take a look at
RStoolbox, a new R package that provides a set of tools for remote sensing processing.
The R+QGIS approach shown in this post expands the image classification methods available in
QGIS. There are other image processing techniques included in QGIS such as those found in the
Semi-Automatic Classification Plugin, the GRASS GIS plugin and the Orfeo Toolbox. I suggest
you also explore these other options.
In a future post Ill write about recommended practices for accuracy assessment of classified
images through the comparison of reference data versus the corresponding classification results.
Stay tuned!
Recommend
Share
Sort by Best
Hi Ali,
I solved the problem of overall accuracy. I have writing what follows:
> tm<- tm_shape(preds_rf)+ tm_raster(alpha = 0, n = 9, style = "pretty",
+ interval.closure = "left", labels = c("Culture", "Batis", "Savane_Arbustive",
"Savane_Herbeuse", "Sol_Argilo_Sableux", "Sol_Sablo_Argileux", "Zone_Humide",
"Eau", "Sable"), auto.palette.mapping = TRUE, max.categories = 9,
+ saturation = 1,interpolate = FALSE, title = "Land Cover of Iro region")
> tm
I get this error message:
Error: cannot allocate vector of size 676.1 Mb
How can resolv this problem?
Sincerely,
o
o
o Reply
o
o Share
http://disq.u
o
Hello. You may use memory.limit to increase limits on memory allocation. Also please
read these threads on Stackoverflow and other forums: http://stackoverflow.com/quest...,
https://stat.ethz.ch/pipermail..., https://www.r-bloggers.com/mem...
o
o
o Reply
o
o Share
http://disq.u
o
hi ali,
i would like to do SVM classification in Landsat ETM+ image.is it possible to do SvM
classification using R language
Reply
Share
o
http://disq.u
o
Hi Manikandan. Yes, it's possible. You can use the svmRadial method from the kernlab
package through caret, for instance:
o
o
o Reply
o
o Share
http://disq.u
hi ali ,Thanks for provide coding for Random forest. actually its help me a lot to learn about R.
as I am learning stage of R for classification in TM ETM+ .im getting error when run the code
"""""" sdfAll <- subset(dfAll[sample(1:nrow(dfAll), nsamples), ])
Error in sample.int(length(x), size, replace, prob) :
cannot take a sample larger than the population when 'replace = FALSE'"""""
Reply
Share
o
http://disq.u
o
Hi Manikandan. To help you, please upload your script and a sample of your data to an
online repository and send me the download link through my contact page:
http://amsantac.co/contact.htm...
o
o
o Reply
o
o Share
http://disq.u
o
o
o
o Reply
o
o Share
http://disq.u
o
o
o
o Reply
o
o Share
http://disq.u
Hi Ali,
There are many samples with all zero values though - could this be the problem? If so, can I just
remove the zero values in a similar way to the NAs? Or perhaps I should remove them all prior
to doing anything else...
see more
Reply
Share
o
http://disq.u
o
Thanks Lucy for your comment. I'm glad you find my posts useful,
I'd suggest you examine the class column in your data. The warning says: 'Can't have
empty classes in y', so there may be some rows lacking class label.
Are those zero values present in more than one class? If so, they may be not providing
any useful information to the classifier and could be removed. If the zero values are
present just in one class, for example water, then it might be better to keep them. Also
please examine whether zero is the value assigned to NoData pixels. Sometimes NoData
pixels are coded as -9999, for instance. It may be the case that NoData pixels in your
image are coded as zero.
o
o
o Reply
o
o Share
http://disq.u
o
Thanks Ali.
I have removed all the zeros from the whole dataset - they took up more than one class as
far as I could see.
The train function now seems to be running...but it is taking a VERY long time. It has
already been running for 24 hrs... My dataset is huge, but this is just the train part so
perhaps something has gone wrong...!
o
o
o Reply
o
o Share
http://disq.u
o
Lucy, if your dataset is huge, then, yes, it may take a very long time. I'd suggest you try
first with a smaller dataset (eg., about 10000 observations) and take the time spent in
training the model, so you have an idea of how long it would take for the whole dataset.
I'd also suggest you use the randomForest package directly, instead of the caret package.
That is:
library(randomForest)
model1 <- randomForest(as.factor(class) ~ R + G + B, data = sdfAll)
Usually this is faster than the train function from caret. Finally, it may depend on type of
data and number of classes, but a model trained with millions of observations may offer
just a slightly higher accuracy than a model trained with a smaller dataset (i.e., thousands
of records)
o
o
o Reply
o
o Share
http://disq.u
Thank you for this comprehensive tutorial. I am trying to run the random forest classifier on a
Landsat 8 image. I have gotten to the part where you extract the training pixel values, and I am
getting the following error:
My code is below:
##Load libraries
library(rgdal)
library(raster)
library(caret)
library(randomForest)
library(e1071)
if(is(trainData, "SpatialPointsDataFrame")){
dataSet <- cbind(dataSet, class = as.numeric(category))
dfAll <- rbind(dfAll, dataSet)
}
if(is(trainData, "SpatialPolygonsDataFrame")){
dataSet <- lapply(dataSet, function(x){cbind(x, class = as.numeric(rep(category, nrow(x))))})
df <- do.call("rbind", dataSet)
dfAll <- rbind(dfAll, df)
}
}
------
If you have any suggestions on how I can fix this, I would be really grateful. Thank you!
see more
Reply
Share
o
http://disq.u
o
Does your training data come perhaps from a file with elevation data (Z coordinates)
included? If so, the Z data have to be dropped when importing the file into R. You can
use the pointDropZ parameter in the readOGR for that purpose:
library(rgdal)
trainData <- readOGR("path_to_your_file", layer = "name_of_your_file", pointDropZ =
TRUE)
Also please see my answer to a comment below (from Johannes May) about the modified
code version for working with points instead of polygons, if that applies to your data,
o
o
o Reply
o
o Share
http://disq.u
Hello,
I applied successfully your tutorial to the classification of image with randomForest.The result
obtained is as follows:
I have some questions:
1) how to make to obtain the grids and legend with the name of each type of land cover like that:
Reply
Share
o
http://disq.u
o
Hello,
o
o
o Reply
o
o Share
http://disq.u
Using a Sentinel 2 image and 3 LC classes, I get "Error in rep(category, nrow(x)) : invalid 'times'
argument " for the rep() function. are you able to suggest how to fix the time argument? This
seems to exceed my personal R expertise. NoData is no issue in my data. Apparently such an
error happens if times gets negative, though I'm not sure how this should happen. Hope you have
a suggestion?
Reply
Share
o
http://disq.u
o
The issue is that you are using a point shapefile instead of a polygons shapefile, which
the algorithm was designed for,
The following code solve the issue. It generates the data.frame with the extracted training
data using a point shape:
if(is(trainData, "SpatialPointsDataFrame")){
dataSet <- cbind(dataSet, class = as.numeric(category))
dfAll <- rbind(dfAll, dataSet)
}
if(is(trainData, "SpatialPolygonsDataFrame")){
dataSet <- lapply(dataSet, function(x){cbind(x, class = as.numeric(rep(category,
nrow(x))))})
df <- do.call("rbind", dataSet)
dfAll <- rbind(dfAll, df)
}
}
I'll be updating the code in the post with this improvement soon.
o
o
o Reply
o
o Share
http://disq.u
o
you are using the lapply command (not sapply). Second as you only have 3 classes, then I
suggest you iterate the command lines inside the loops by yourself. I mean, set i = 1, and then
run the six lines inside the loop. Then set i = 2, and repeat, and then do the same for i = 3. That
way it may be easier for you to identify at what step the error shows up. If you don't identify the
error, I'd suggest you send me a message through my contact page
(http://amsantac.co/contact.htm... with links for downloading a sample of your data so I can
check them out. I'd be glad to help you solve this issue
Reply
Share
o
o
http://disq.u
Interesting! Can you post/refer more satellite image classification with different machine
learning algorithms like boosting, neural network, convolutional neural network, ensemble
learning, svm, ridge regression, back propagation etc.
Reply
Share
o
http://disq.u
o
Yes, that's my plan! As soon as I have some free time (hopefully in a couple of weeks) I'll
be posting more on this topic. Stay tuned!
o
o
o Reply
o
o Share
http://disq.u
o
Thanks. Could you elaborate how can I save the classified image, as output of your
random forest algorithm, as a georeferenced image whereas I can compare this classified
image with the raw unclassified image in QGIS. Thanks..
o
o
o Reply
o
o Share
http://disq.u
o
o
Ali Mod Shariful Islam 4 months ago
Sure. Following my example, one can export the classified image with the writeRaster
command:
writeRaster(preds_rf, "exported.tif")
o
o
o Reply
o
o Share
http://disq.u
o
Thank you. could you say why i am having trouble in training like modFit_rf <-
train(as.factor(class) ~ B3 + B4 + B5, method = "J48", data = sdfAll) where I am using
C4.5-like Trees model as described in http://topepo.github.io/caret/...
Even I installed package named 'RWeka' separately but all went vain.
Thanks in advance..
o
o
o Reply
o
o Share
http://disq.u
o
o
o
o Reply
o
o Share
http://disq.u
thanks for the great example, i am actually getting warnings that i don't know the reason of it. i
got 35 warnings messages saying "model fit failed for Resample......Can't have empty classes in
y " although i am sure that all the polygons got class value (in string and also number) same as
you did in your example.
Reply
Share
o
http://disq.u
o
Hi Ahmed. A possible reason is that there are pixels with No Data values (NAs) in one or
more of the bands of your raster files. If you are using my script, you may test presence
of NAs with: any(is.na(dfAll)). To remove data.frame rows with NAs, use: na.omit(dfAll)
o
o
o Reply
o
o Share
http://disq.u
Interesting Post. I am having hyperspectral images contains 102 bands. Whether the above
methods will work?.. Suggest me .. Thank you
Reply
Share
o
http://disq.u
o
Hi Nandhini. Thanks for your comment. Yes, these methods should work if you are
interested in conducting a supervised classification. Be aware, however, that the number
of bands is quite large so trying RandomForests in R with you hyperspectral image can be
computationally intensive. Best regards.
o
o
o Reply
o
o Share
o
http://disq.u
I am following this tutorial, however I am having difficulty loading the Landsat 8 image with 12
bands. The directory is changed to point to the image path however there seems to be a problem
using the brick function. Each band [1:12] is a .tif file. Appreciate you assistance.
Reply
Share
o
http://disq.u
o
Hi. Please verify that all the bands that you want to stack (or brick) have the same extent.
For example, the panchromatic band (B8) may have a different extent. I'd also suggest
you use the stack command (before using brick). Hope this helps to solve the problem
you found
o
o
o Reply
o
o Share
http://disq.u
Nice post. I have images that are fMRI images. I need to extract the features and prepare a
dataset out of it. After that i have to use the random forests for the classification purpose.
Can you please help me how to do that. Also at the same time i need to lower the number of
features as they are in huge number.
Reply
Share
o
http://disq.u
o
o
Thanks for your comment. That's an interesting question. What is the format of your
fMRI images? If they can be read into R as a multilayer image format, then the remaining
processing would be quite similar to what I show in this post. I'd be glad to talk to you
about that off this thread, so please send me a message through my contact page:
http://amsantac.co/contact.htm...
o
o
o Reply
o
o Share
http://disq.u
Reply
Share
o
http://disq.u
o
I just tried the viewRGB function. It's awesome! The mapview package has very
interesting and useful features. Thanks a lot for sharing, Tim!
o
o
o Reply
o
o Share
http://disq.u
o
o
o 1
o
o Reply
o
o Share
http://disq.u
Great post! Thanks for the detailed explanation and tutorial. Definitely a very useful tool I have
to use.
Reply
Share
o
http://disq.u
o
Thank you very much Diego for your comment. Hope this post serves for your research.
I'll be posting new content in my blog soon. Saludos!
o
o
o Reply
o
o Share
http://disq.u
Nice post, very usefull, would like try with LIDAR data?
If you, ican share a small piece
Best Regards
Rafaela
Reply
Share
o
http://disq.u
o
Hi Maria, if you want to classify DEM's, DSM's or derivatives from LiDAR, check out
this paper: http://www.mdpi.com/2072-4292/... (shameless self promotion). There is also
a script for running random forest in R and a small lidar dataset to test it out.
o 1
o
o Reply
o
o Share
http://disq.u
o
cheers
Rafaela
o
o
o Reply
o
o Share
http://disq.u
o
Thanks for paper...This script who amsantac is done. It is working well, I did not try with LiDAR
already...Actually have not seen LiDAR and Random Forest. I tried to do some things, but never
I got successful. Can you show to us?
Cheers
Rafaela
my email: rafasalum@hotmail.com
Reply
Share
o
http://disq.u
o
Hi Maria,
in the supplemental information of that paper (scroll down almost to the bottom) you can
download a small piece of lidar data and some training data and try with the script
provided.
o
o
o Reply
o
o Share
http://disq.u
For LiDAR data classification, I would recommend to use specialized algorithms such as those
provided by LAStools (http://rapidlasso.com/lastools... or MARS software
(http://www.merrick.com/Geospat...,
Reply
Share
o
http://disq.u
Thanks for this post. Can you please post the datasets you used? i would like to test the code on
your datasets and explore more?
Reply
Share
o
http://disq.u
o
Thanks for your comment. I considered to post the datasets but discarded the idea due to
the large file size of the Landsat images. You may find recent images for your area of
interest using EarthExplorer which has a user-friendly interface for browsing and
downloading images
o
o
o Reply
o
o Share
http://disq.u
Hello,
I applied successfully your tutorial to the classification of image with randomForest.The result
obtained is as follows:
Reply
Share
o
http://disq.u
o
o
o
Hello,
1) There are several options to create the grids and legend. You can use either the spplot,
rasterVis or tmap packages. For spplot examples see this link: https://edzer.github.io/sp/.
For tmap see: https://cloud.r-project.org/we.... For example, the starting code for tmap
would be:
library(tmap)
tm_shape(preds_rf) + tm_raster() + tm_grid() # customize it changing the default
parameters
2) You can get the accuracy and kappa metrics (based on the training dataset) just by
printing your model (eg., modFit_rf). For calculating accuracy metrics by cross-
tabulating observed and predicted classes you can use the 'confusionMatrix' command
from the caret package.
3) If you look at the elapsed time, it tells you that the previous command line took 1203
seconds (about 20 minutes) to be processed
4) In neuralnet, all weights are initialized by default with random values drawn from a
standard normal distribution. If you set startweights = NULL, then the weights will be
randomly initialized. If you have computed your own weigths, then you can enter them as
a vector for the startweights parameter
o
o
o Reply
o
o Share
http://disq.u
o
o
o
Hi,
Thank you for your explanation.
I would like to have more information about data and reference in "confusMatrix. When i
type :
L8_Iro_2016<-stack("L8_Iro_2016.tif")
names(L8_Iro_2016)<-c(paste0("B",1:3))
writeRaster(L8_Iro_2016, filename="merge_Iro_2016.tif",overwrite=TRUE)
plotRGB(L8_Iro_2016*(L8_Iro_2016>=0), r=1,g=2, b=3, stretch="hist",scale=10000)
trainData<-shapefile("F://Sminaire_R/Classification_randomForest/training9.shp")
reponseCol1<-"CLASS_ID"
trainData<-shapefile("F://Sminaire_R/Classification_randomForest/trainTrue9.shp")
reponseCol2<-"CLASS_ID"
##Extracting Training pixels values pour Iro
dfAll = data.frame(matrix(vector(), nrow = 0, ncol = length(names(L8_Iro_2016)) + 1))
for (i in 1:length(unique(trainData[[reponseCol1]]))){
category <- unique(trainData[[reponseCol1]])[i]
categorymap <- trainData[trainData[[reponseCol1]] == category,]
dataSet <- extract(L8_Iro_2016, categorymap)
dataSet <- lapply(dataSet, function(x){cbind(x, class = as.numeric(rep(category,
nrow(x))))})
df <- do.call("rbind", dataSet)
dfAll <- rbind(dfAll, df)
}
##Model fitting and image classification for Iro using dfAll and "rf"
modFit_rf <- train(as.factor(class) ~ B1 + B2 + B3, method = "rf", data = dfAll)
beginCluster()
#preds_rf <- clusterR(L8_Iro_2016, raster::predict, args = list(model = modFit_rf))
system.time(preds_rf<-clusterR(L8_Iro_2016,raster::predict,args =
list(model=modFit_rf)))
endCluster()
plot(preds_rf)
print(modFit_rf)
data<-c("reponseCol1", "reponseCol2")
see more
o
o
o Reply
o
o Share
http://disq.u
o
Hi. First, you have to extract the band pixel values for the trainTrue9.shp polygons:
(I've not tested the code above so it may have bugs) Please note that you should use an
independent dataset for validation. I recommend you read these other posts which can
provide help on validation: http://amsantac.co/blog/en/201...,
http://amsantac.co/blog/en/201....
Please also note that you are overwriting the trainData object in your code.