Data Mining Lab Questions

Data Mining
Set-1
1. Convert .arff file into .csv file using Java-WEKA API

Procedure:
1)Open Eclipse
2)File->create new Java Project and Name it
3)In that Project file ,create a new folder and name its as lib
4)Copy weka.jar file and paste in the lib folder
5)Refresh eclipse by pressing F5 and click on lib folder->right click on weka.jar-
>build path->Add to Build path
6)Create a new class in the Project and write the respective code
7)run it
(PS:Give the File paths Correctly)
ArffToCsv.java
import weka.core.Instances;
import weka.core.converters.ArffLoader;
import weka.core.converters.CSVSaver;
import java.io.File;
public class ArffToCsv
{
public static void main(String[] args)throws Exception
{
ArffLoader Loader = new ArffLoader();
Loader.setSource(new File("C:\\Users\\vinay\\eclipse-
workspace\\CsvToArff\\Desktop\\Test_arff.arff"));
Instances data=Loader.getDataSet();
CSVSaver saver= new CSVSaver();
saver.setInstances(data);
saver.setFile(new File("airline.csv"));
saver.setDestination(new File("Desktop/airline.csv"));
saver.writeBatch();
System.out.println("Success\n");
}
}
output:
Success
2. Linear Regression using R-Tool

(https://www.tutorialspoint.com/r/r_linear_regression.htm)
x <- c(151, 174, 138, 186, 128, 136, 179, 163, 152, 131)
y <- c(63, 81, 56, 91, 47, 57, 76, 72, 62, 48)
relation<-lm(y~x)
print(relation)
print(summary(relation))
a<-data.frame(x=170)
result<-predict(relation,a)
print(result)
png(file = "linearregression.png")
#Plot the chart.
plot(y,x,col = "blue",main = "Regression",abline(lm(x~y)),cex = 1.3,pch = 16,xlab =
"Weight in Kg",ylab = "Height in cm")
dev.off()
output:
Set-2
1. Convert .csv file into .arff file using Java-WEKA API
Procedure:
1)Open Eclipse
2)File->create new Java Project and Name it
3)In that Project file ,create a new folder and name its as lib
4)Copy weka.jar file and paste in the lib folder
5)Refresh eclipse by pressing F5 and click on lib folder->right click on weka.jar-
>build path->Add to Build path
6)Create a new class in the Project and write the respective code
7)run it(PS:Give the File paths Correctly)
CsvToArff.java
import weka.core.Instances;
import weka.core.converters.CSVLoader;
import weka.core.converters.ArffSaver;
import java.io.File;
public class CsvToArff
{
public static void main(String[] args)throws Exception
{
CSVLoader Loader = new CSVLoader();
Loader.setSource(new File("C:\\Users\\vinay\\OneDrive\\Documents\\R\\win-
library\\3.5\\Hmisc\\tests\\csv\\TEST.csv")); //Path of the file
Instances data=Loader.getDataSet();
ArffSaver saver= new ArffSaver();
saver.setInstances(data);
saver.setFile(new File("Test_arff.arff"));
saver.setDestination(new File("Desktop/Test_arff.arff"));
saver.writeBatch();
System.out.println("Success\n");
}
}
output:
Success
2. K-NN in WEKA
Procedure:
1)Open Weka and open any categorical arff file
2) Go to classifiers tab
3)On top left corner click choose
4)Click on lazy and select IBK
5)start
NAME
weka.classifiers.lazy.IBk
K-nearest neighbours classifier. Can select appropriate value of K based on cross-

validation. Can also do distance weighting.
OPTIONS
numDecimalPlaces -- The number of decimal places to be used for the output of
numbers in the model.
batchSize -- The preferred number of instances to process if batch prediction is
being performed. More or fewer instances may be provided, but this gives
implementations a chance to specify a preferred batch size.
KNN -- The number of neighbours to use.
distanceWeighting -- Gets the distance weighting method used.
nearestNeighbourSearchAlgorithm -- The nearest neighbour search algorithm to
use (Default: weka.core.neighboursearch.LinearNNSearch).
debug -- If set to true, classifier may output additional info to the console.
windowSize -- Gets the maximum number of instances allowed in the training
pool. The addition of new instances above this value will result in old instances
being removed. A value of 0 signifies no limit to the number of training instances.
doNotCheckCapabilities -- If set, classifier capabilities are not checked before
classifier is built (Use with caution to reduce runtime).
meanSquared -- Whether the mean squared error is used rather than mean
absolute error when doing cross-validation for regression problems.
crossValidate -- Whether hold-one-out cross-validation will be used to select the
best k value between 1 and the value specified as the KNN parameter.
Set-3
1. Convert .arff file into .csv file using WEKA

Procedure:
1) Open Weka GUI Chooser(Weka)
2) Click on Tools
3)click on Arff viewer
4) Click on file and open a .arff file
5) click on file and save it as .csv file
2. Demonstrate the classifier J48 in WEKA and Visualise Decision Tree

in WEKA
Procedure:
1)Open Weka
2)Open a .arff data file
3)click on classify
4)click on choose->trees->j48
5)click on start
6)on the algorithm running in left side box click on the algorithm and
select visualize tree.
NAME
weka.classifiers.trees.J48
SYNOPSIS
Class for generating a pruned or unpruned C4.5 decision tree.
Set-4
1. Perform K-NN classification of Time series data using R-Tool
2. Perform the following preprocessing operations in WEKA

a. Attribute selection
NAME
weka.filters.supervised.attribute.AttributeSelection
SYNOPSIS
A supervised attribute filter that can be used to select attributes. It is
very flexible and allows various search and evaluation methods to be
combined.
b. Handling missing values
NAME
weka.filters.unsupervised.attribute.ReplaceMissingValues
SYNOPSIS
Replaces all missing values for nominal and numeric attributes in a
dataset with the modes and means from the training data. The class
attribute is skipped by default.
c. Discretisation
NAME
weka.filters.unsupervised.attribute.Discretize
SYNOPSIS
An instance filter that discretizes a range of numeric attributes in the
dataset into nominal attributes. Discretization is by simple binning.
Skips the class attribute if set.
d. Converting nominal attributes to binary attributes
NAME
weka.filters.unsupervised.attribute.NominalToBinar
SYNOPSIS
Converts all nominal attributes into binary numeric attributes. An
attribute with k values is transformed into k binary attributes if the
class is nominal (using the one-attribute-per-value approach). Binary
attributes are left binary if option '-A' is not given. If the class is
numeric, you might want to use the supervised version of this filter.
e. Standardisation
NAME
weka.filters.unsupervised.attribute.Standardize
SYNOPSIS
Standardizes all numeric attributes in the given dataset to have zero
mean and unit variance (apart from the class attribute, if set).
Set-5
1. Random Forest in R
2. Perform the following preprocessing operations in WEKA
a. Attribute selection
NAME
weka.filters.supervised.attribute.AttributeSelection
SYNOPSIS
A supervised attribute filter that can be used to select attributes. It is
very flexible and allows various search and evaluation methods to be
combined.
b. Normalisation
NAME
weka.filters.unsupervised.attribute.Normalize
SYNOPSIS
Normalizes all numeric values in the given dataset (apart from the class
attribute, if set). By default, the resulting values are in [0,1] for the data
used to compute the normalization intervals. But with the scale and
translation parameters one can change that, e.g., with scale = 2.0 and
translation = -1.0 you get values in the range [-1,+1].
c. Outlier detection
d.. Discretisation
NAME
weka.filters.unsupervised.attribute.Discretize
SYNOPSIS
An instance filter that discretizes a range of numeric attributes in the
dataset into nominal attributes. Discretization is by simple binning.
Skips the class attribute if set.
e. Handle missing values
NAME
weka.filters.unsupervised.attribute.ReplaceMissingValues
SYNOPSIS
Replaces all missing values for nominal and numeric attributes in a
dataset with the modes and means from the training data. The class
attribute is skipped by default.
Set-6
1. Analyse time series data using Dynamic Time Warping using R-Tool
2. Naive Bayes in WEKA
NAME
weka.classifiers.bayes.NaiveBayes
SYNOPSIS
Class for a Naive Bayes classifier using estimator classes. Numeric
estimator precision values are chosen based on analysis of the training
data. For this reason, the classifier is not an UpdateableClassifier (which
in typical usage are initialized with zero training instances) -- if you need
the UpdateableClassifier functionality, use the NaiveBayesUpdateable
classifier. The NaiveBayesUpdateable classifier will use a default
precision of 0.1 for numeric attributes when buildClassifier is called
with zero training instances.
OPTIONS
useKernelEstimator -- Use a kernel estimator for numeric attributes
rather than a normal distribution.
numDecimalPlaces -- The number of decimal places to be used for the
output of numbers in the model.
batchSize -- The preferred number of instances to process if batch
prediction is being performed. More or fewer instances may be
provided, but this gives implementations a chance to specify a
preferred batch size.
debug -- If set to true, classifier may output additional info to the
console.
displayModelInOldFormat -- Use old format for model output. The old
format is better when there are many class values. The new format is
better when there are fewer classes and many attributes.
doNotCheckCapabilities -- If set, classifier capabilities are not checked
before classifier is built (Use with caution to reduce runtime).
useSupervisedDiscretization -- Use supervised discretization to convert
numeric attributes to nominal ones
Set-7
1. Perform time series decomposition and forecasting in R
2. Adaboost in WEKA
NAME
weka.classifiers.meta.AdaBoostM1
SYNOPSIS
Class for boosting a nominal class classifier using the Adaboost M1
method. Only nominal class problems can be tackled. Often
dramatically improves performance, but sometimes overfits
Set-8
1. Classify Time series data using R-tool
2. Bagging in WEKA
weka.classifiers.meta.Bagging
SYNOPSIS
Class for bagging a classifier to reduce variance. Can do classification
and regression depending on the base learner.
Set-9
1. Perform hierarchical clustering on time series data in R
2. Random Forest in WEKA

NAME
weka.classifiers.trees.RandomForest
SYNOPSIS
Class for constructing a forest of random trees.
Set-10
1. DBSCAN in R
2. Generate association rules using Apriori in WEKA

NAME
weka.associations.Apriori
SYNOPSIS
Class implementing an Apriori-type algorithm. Iteratively reduces the
minimum support until it finds the required number of rules with the
given minimum confidence.
The algorithm has an option to mine class association rules. It is
adapted as explained in the second reference.
Set-11
1. Write a program in R-Tool for performing sentiment analysis in
twitter data
# Install Requried Packages
#'Syuzhet' package will be used for sentiment analysis;
#while 'tm' and 'SnowballC' packages are used for text mining and
analysis.
install.packages("SnowballC")
install.packages("tm")
install.packages("twitteR")
install.packages("syuzhet")
# Load Requried Packages

library("SnowballC")
library("tm")
library("twitteR")
library("syuzhet")
# Authonitical keys
consumer_key <- 'ABCDEFGHI1234567890'
consumer_secret <- 'ABCDEFGHI1234567890'
access_token <- 'ABCDEFGHI1234567890'
access_secret <- 'ABCDEFGHI1234567890'
setup_twitter_oauth(consumer_key, consumer_secret, access_token,
access_secret)
tweets <- userTimeline("realDonaldTrump", n=200)
n.tweet <- length(tweets)
tweets.df <- twListToDF(tweets)
head(tweets.df)
#Load the file from hard disk
tweets.df=read.csv("D:\\two\\Sentiment.csv")
head(tweets.df$text)
dim(tweets.df)
head(tweets.df$text)
tweets.df2 <- gsub("http.*","",tweets.df$text)
tweets.df2 <- gsub("https.*","",tweets.df2)
tweets.df2 <- gsub("#.*","",tweets.df2)
#tweets.df2 <- gsub("#.*","",tweets.df$text)
tweets.df2 <- gsub("@.*","",tweets.df2)
#To match occurrence of any word preceeded by @symbol
#gsub("@[A-Za-z]+","","hai @hello cbit")
#to match occurrence of a single back slash in source
tweets.df2 <- gsub("////","",tweets.df2)
head(tweets.df2)
#Getting sentiment score for each tweet
word.df <- as.vector(tweets.df2)
emotion.df <- get_nrc_sentiment(word.df)
#Cbind in R appends or combines vector, matrix or data frame by
columns.
emotion.df2 <- cbind(tweets.df2, emotion.df)
head(emotion.df2,100)
#get_sentiment function to extract sentiment score for each of the
tweets
sent.value <- get_sentiment(word.df)
most.positive <- word.df[sent.value == max(sent.value)]
most.positive
most.negative <- word.df[sent.value <= min(sent.value)]
most.negative
sent.value
#segregate positive and negative tweets based on the score assigned to
each of the tweets.
positive.tweets <- word.df[sent.value > 0]
head(positive.tweets)
#Negative Tweets
negative.tweets <- word.df[sent.value < 0]
head(negative.tweets)
#Neutral tweets
neutral.tweets <- word.df[sent.value == 0]
head(neutral.tweets)
# Alternate way to classify as Positive, Negative or Neutral tweets
category_senti <- ifelse(sent.value < 0, "Negative", ifelse(sent.value > 0,
"Positive", "Neutral"))
head(category_senti)
table(category_senti)
2. K-Means in WEKA
NAME
weka.clusterers.SimpleKMeans
SYNOPSIS
Cluster data using the k means algorithm. Can use either the Euclidean
distance (default) or the Manhattan distance. If the Manhattan distance
is used, then centroids are computed as the component-wise median
rather than mean.
Set-12
1. Write a program in R-tool for displaying word cloud
install.packages('Scale', dependencies=TRUE,
repos='http://cran.rstudio.com/')
install.packages('tm', dependencies=TRUE,
repos='http://cran.rstudio.com/')
options(repos='http://cran.rstudio.com/')
install.packages("SnowballC")
#Functionality to create pretty word clouds, visualize differences and
similarity between documents,
#and avoid over-plotting in scatter plots with text.
install.packages("wordcloud")
#Provides color schemes for maps (and other graphics) designed by
Cynthia Brewer
#http://www.sthda.com/english/wiki/text-mining-and-word-cloud-
fundamentals-in-r-5-simple-steps-you-should-know
install.packages("RColorBrewer")
#The stringr package provide a cohesive set of functions
#designed to make working with strings as easy as possible
install.packages("stringr")
#Provides an interface to the Twitter web API.
install.packages("twitteR")
library(Scale)
library(tm)
library(SnowballC)
library(wordcloud)
library(RColorBrewer)
library(stringr)
library(twitteR)
url <- "http://www.rdatamining.com/data/rdmTweets-201306.RData"

download.file(url, destfile = "rdmTweets-201306.RData")
load("rdmTweets-201306.RData")
## load tweets into R

load(file = "rdmTweets-201306.RData")
#get the data from whatsapp chat

text <- readLines("rdmTweets-201306.RData")
#let us create the corpus

#A vector source interprets each element of the vector x as a
document.-VectorSource(x)
#Representing and computing on corpora.
#Corpora are collections of documents containing (natural language)
text.
docs <- Corpus(VectorSource(text))
#clean our chat data

#content_transformer-Create content transformers, i.e., functions
which modify the content of an R object.
#tm_map-Interface to apply transformation functions (also denoted as
mappings) to corpora.
trans <- content_transformer(function (x , pattern ) gsub(pattern, " ",
x))
docs <- tm_map(docs, trans, "/")
docs <- tm_map(docs, trans, "@")
docs <- tm_map(docs, trans, "\\|")
docs <- tm_map(docs, content_transformer(tolower))
docs <- tm_map(docs, removeNumbers)
docs <- tm_map(docs, removeWords, stopwords("english"))
docs <- tm_map(docs, removeWords, c("sudharsan","friendName"))
docs <- tm_map(docs, removePunctuation)
docs <- tm_map(docs, stripWhitespace)
docs <- tm_map(docs, stemDocument)
#create the document term matrix

dtm <- TermDocumentMatrix(docs)
mat <- as.matrix(dtm)
v <- sort(rowSums(mat),decreasing=TRUE)
#Data frame
data <- data.frame(word = names(v),freq=v)
head(data, 10)
#generate the wordcloud
#words : the words to be plotted
#freq : their frequencies
#min.freq : words with frequency below min.freq will not be plotted
#max.words : maximum number of words to be plotted
#random.order : plot words in random order. If false, they will be
plotted in decreasing frequency
#rot.per : proportion words with 90 degree rotation (vertical text)
#colors : color words from least to most frequent. Use, for example,
colors ="black" for single color.
set.seed(1056)
wordcloud(words = data$word, freq = data$freq, min.freq = 1,
max.words=200, random.order=FALSE, rot.per=0.35,
colors=brewer.pal(8, "Dark2"))
2. Generate association rules using Apriori and FP-Growth in WEKA
APRIORI
NAME
weka.associations.Apriori
SYNOPSIS
Class implementing an Apriori-type algorithm. Iteratively reduces the
NAME
weka.associations.FPGrowth
FP-GROWTH
SYNOPSIS
Class implementing the FP-growth algorithm for finding large item sets
without candidate generation. Iteratively reduces the minimum support
until it finds the required number of rules with the given minimum
metric. minimum support until it finds the required number of rules
with the given minimum confidence.
The algorithm has an option to mine class association rules. It is
adapted as explained in the second reference.
Set-13
1. Generate Association rules using Apriori in R-Tool
ibrary(arules)
library(arulesViz)
library(datasets)
data("Groceries")
itemFrequencyPlot(Groceries,topN=20,type="absolute")
rules<-apriori(Groceries,parameter=list(supp=0.001,conf=0.6))
options(digits=2)
inspect(rules[1:5])
plot(rules,method="graph",interactive=TRUE,shading="confidenc
e")
#sorting rules
rules<-sort(rules,by="confidence",decreasing=TRUE)
#using appearance
rules<-apriori(Groceries,parameter=list(supp=0.001,conf=0.6),
appearance=list(default="lhs",rhs="whole milk"),
control=list(verbose=F))
rules<-sort(rules,by="confidence",decreasing=TRUE)
options(digits=2)
inspect(rules[1:5])
#plotting
plot(rules,method="graph",interactive=TRUE,shading=NA)
2. Adaboost in R
Set-14
1. Generate Association rules using FP-Growth in R-Tool
library(rCBA)
data("iris")
classifier<-
rCBA::buildFPGrowth(iris[sample(nrow(iris),20),],"Species",paralle
l=FALSE)
model<-classifier$model
predictions<-rCBA::classification(iris,model)
table(predictions)
sum(as.character(iris$Species)==as.character(predictions),na.rm
=TRUE)/length(predictions)
2. K-NN in R
data <-read.csv("Wholesale customers data.csv",header=T)
summary(data)
top.n.custs <- function (data,cols,n=5) { #Requires some data frame and
the top N to remove
idx.to.remove <-integer(0) #Initialize a vector to hold customers being
removed
for (c in cols){ # For every column in the data we passed to this
function
col.order <-order(data[,c],decreasing=T) #Sort column "c" in
descending order (bigger on top)
#Order returns the sorted index (e.g. row 15, 3, 7, 1, ...) rather than
the actual values sorted.
idx <-head(col.order, n) #Take the first n of the sorted column C to
idx.to.remove <-union(idx.to.remove,idx) #Combine and de-duplicate
the row ids that need to be removed
}
return(idx.to.remove) #Return the indexes of customers to be
removed
idx.to.remove
}
top.custs <-top.n.custs(data,cols=3:8,n=5)
idx.to.remove
length(top.custs) #How Many Customers to be Removed?
data[top.custs,] #Examine the customers
data.rm.top <-data[-c(top.custs),] #Remove the Customers
set.seed(76964057) #Set the seed for reproducibility
k <-kmeans(data.rm.top[,-c(1,2)], centers=5) #Create 5 clusters,
Remove columns 1 and 2
k$centers #Display cluster centers
table(k$cluster) #Give a count of data points in each cluster
rng<-2:20 #K from 2 to 20
tries<-100 #Run the K Means algorithm 100 times
1wavg.totw.ss<-integer(length(rng)) #Set up an empty vector to hold all
of points
for(v in rng){ # For each value of the range variable
v.totw.ss<-integer(tries) #Set up an empty vector to hold the 100 tries
for(i in 1:tries){
k.temp<-kmeans(data.rm.top,centers=v) #Run kmeans
v.totw.ss[i]<-k.temp$tot.withinss#Store the total withinss
}
avg.totw.ss[v-1]<-mean(v.totw.ss) #Average the 100 total withinss
}
plot(rng,avg.totw.ss,type="b", main="Total Within SS by Various K",
ylab="Average Total Within Sum of Squares",
xlab="Value of K"
Set-15
1. Build a Decision Tree Classifier in R-Tool using the packages Party.
library(readr)
library(dplyr)
library(party)
library(rpart)
library(rpart.plot)
library(ROCR)
library(magrittr)
titanic3<-"https://goo.gl/At238b"%>%
read.csv%>%
select(survived,embarked,sex,sibsp,parch,fare)%>%
mutate(embarked=factor(embarked),sex=factor(sex))
#splitting data
.data<-c("training","test")%>%
sample(nrow(titanic3),replace=TRUE)%>%
split(titanic3,.)
#implementation using rpart

rtree_fit<-rpart(survived~.,.data$training)
rpart.plot(rtree_fit)
#implementation using ctree

tree_fit<-ctree(survived~.,data=.data$training)
tree_roc<-tree_fit%>%
predict(newdata=.data$test)%>%
prediction(.data$test$survived)%>%
performance("tpr","fpr")
tree_roc
2. Hierarchical Clustering in R
Agnes.R
library(tidyverse) # data manipulation
library(cluster) # clustering algorithms
library(factoextra) # clustering visualization
library(dendextend) # for comparing two dendrograms
df <- USArrests
df <- na.omit(df)
df <- scale(df)
head(df)
# Dissimilarity matrix
d <- dist(df, method = "euclidean")
# Hierarchical clustering using Complete Linkage

hc1 <- hclust(d, method = "complete" )
# Plot the obtained dendrogram

plot(hc1, cex = 0.6, hang = -1 ,main = "Dendrogram of Agnes")
Diana.R
library(tidyverse) # data manipulation
library(cluster) # clustering algorithms
library(factoextra) # clustering visualization
library(dendextend) # for comparing two dendrograms
df <- USArrests
df
df <- na.omit(df)
df <- scale(df)
head(df)
# Dissimilarity matrix
d <- dist(df, method = "euclidean")
# compute divisive hierarchical clustering
hc4 <- diana(df)
# Divise coefficient; amount of clustering structure found

hc4$dc
## [1] 0.8514345
# plot dendrogram
pltree(hc4, cex = 0.6, hang = -1, main = "Dendrogram of diana")
Set-16
1. Build a Decision Tree Classifier in R-Tool using the packages caret.
infogain.R
library(caret)
library(rpart.plot)
data_url <- c("https://archive.ics.uci.edu/ml/machine-learning-
databases/car/car.data")
download.file(url = data_url, destfile = "car.data")
car_df <- read.csv("car.data", sep = ',', header = FALSE)
str(car_df)
head(car_df)
set.seed(3033)
intrain <- createDataPartition(y = car_df$V7, p= 0.7, list = FALSE)
training <- car_df[intrain,]
testing <- car_df[-intrain,]
#check dimensions of train & test set
dim(training); dim(testing);
anyNA(car_df)
summary(car_df)
trctrl <- trainControl(method = "repeatedcv", number = 10, repeats = 3)
set.seed(3333)
dtree_fit <- train(V7 ~., data = training, method = "rpart",
parms = list(split = "information"),
trControl=trctrl,
tuneLength = 10)
dtree_fit
prp(dtree_fit$finalModel, box.palette = "Reds", tweak = 1.2)

test_pred <- predict(dtree_fit, newdata = testing)
confusionMatrix(test_pred, testing$V7 )
gini.R
library(caret)
library(rpart.plot)
data_url <- c("https://archive.ics.uci.edu/ml/machine-learning-
databases/car/car.data")
download.file(url = data_url, destfile = "car.data")
car_df <- read.csv("car.data", sep = ',', header = FALSE)
str(car_df)
head(car_df)
set.seed(3033)
intrain <- createDataPartition(y = car_df$V7, p= 0.7, list = FALSE)
training <- car_df[intrain,]
testing <- car_df[-intrain,]
#check dimensions of train & test set
dim(training); dim(testing);
anyNA(car_df)
summary(car_df)
trctrl <- trainControl(method = "repeatedcv", number = 10, repeats = 3)
set.seed(3333)
dtree_fit_gini <- train(V7 ~., data = training, method = "rpart",
parms = list(split = "gini"),
trControl=trctrl,
tuneLength = 10)
dtree_fit_gini
prp(dtree_fit_gini$finalModel, box.palette = "Reds", tweak = 1.2)
test_pred_gini <- predict(dtree_fit_gini, newdata = testing)
confusionMatrix(test_pred_gini, testing$V7 ) #check accuracy
2. Naive Bayes in R
#install.packages("caret")
library(caTools)
library(caret)
library(e1071)
## Load vcd package
library(vcd)
## Load Arthritis dataset (data frame)

data(Arthritis)
head(Arthritis)
df<-data.frame(Arthritis)
nrow(Arthritis)
ind<-sample(nrow(Arthritis),floor(nrow(Arthritis)*0.3))
ind
train<-df[ind,]
train
train$Improved
test<-df[-ind,]
x_train<-train[]
x_train
y_train<-train$Improved
y_train
x_test<-test[]
x_test
y_test<-test$Improved
y_test
classifier=naiveBayes(x_train,y_train,laplace=0)
classifier
predictions<-predict(classifier,x_test)
conf<-confusionMatrix(predictions,y_test)
print(conf)
Set-17
1. Build a Decision Tree Classifier in R-Tool using the packages rpart.
library(readr)
library(dplyr)
library(party)
library(rpart)
library(rpart.plot)
library(ROCR)
library(magrittr)
titanic3<-"https://goo.gl/At238b"%>%
read.csv%>%
select(survived,embarked,sex,sibsp,parch,fare)%>%
mutate(embarked=factor(embarked),sex=factor(sex))
#splitting data
.data<-c("training","test")%>%
sample(nrow(titanic3),replace=TRUE)%>%
split(titanic3,.)
#implementation using rpart

rtree_fit<-rpart(survived~.,.data$training)
rpart.plot(rtree_fit)
#implementation using ctree
tree_fit<-ctree(survived~.,data=.data$training)
tree_roc<-tree_fit%>%
predict(newdata=.data$test)%>%
prediction(.data$test$survived)%>%
performance("tpr","fpr")
tree_roc
2. K-Means in R
## ----setup, include = FALSE----------------------------------------------
knitr::opts_chunk$set(message = FALSE, warning = FALSE)
library(dplyr)
library(ggplot2)
library(purrr)
library(tibble)
library(tidyr)
set.seed(27)
centers <- tibble(

cluster = factor(1:3),
num_points = c(100, 150, 50), # number points in each cluster
x1 = c(5, 0, -3), # x1 coordinate of cluster center
x2 = c(-1, 1, -2) # x2 coordinate of cluster center
)
labelled_points <- centers %>%

mutate(
x1 = map2(num_points, x1, rnorm),
x2 = map2(num_points, x2, rnorm)
) %>%
select(-num_points) %>%
unnest(x1, x2)
ggplot(labelled_points, aes(x1, x2, color = cluster)) +

geom_point()
points <- labelled_points %>%

select(-cluster)
kclust <- kmeans(points, centers = 3)
kclust
summary(kclust)
library(broom)
augment(kclust, points)
tidy(kclust)
glance(kclust)
kclusts <- tibble(k = 1:9) %>%
mutate(
kclust = map(k, ~kmeans(points, .x)),
tidied = map(kclust, tidy),
glanced = map(kclust, glance),
augmented = map(kclust, augment, points)
)
kclusts
clusters <- kclusts %>%
unnest(tidied)
assignments <- kclusts %>%
unnest(augmented)
clusterings <- kclusts %>%
unnest(glanced, .drop = TRUE)
p1 <- ggplot(assignments, aes(x1, x2)) +
geom_point(aes(color = .cluster)) +
facet_wrap(~ k)
p1
p2 <- p1 + geom_point(data = clusters, size = 10, shape = "x")
p2
ggplot(clusterings, aes(k, tot.withinss)) +
geom_line()
Set-18
1. Generate Association rules using ECLAT in R-Tool
ibrary(arules)
library(arulesViz)
library(datasets)
data("Groceries")
itemFrequencyPlot(Groceries,topN=20,type="absolute")
itemsets<-eclat(Groceries,parameter=list(supp=0.001,maxlen=3))
rules<-ruleInduction(itemsets,Groceries,confidence=0.6)
inspect(rules[1:5])
plot(rules,method="graph",interactive=TRUE,shading=NA
2. Bagging in R
Set-19
1. Hierarchical Clustering in WEKA
2. Data visualisation and exploration in R
a. Read the dataset into R-Dataframe
b. Get the first 5 rows
c. Correlation of two dimensions
d. Histogram of an attribute
e. Cleveland Dot Charts
f. Bar Charts
g. Pie chart
h. Line charts for both numeric and categorical dimensions
(Note your observations, Comment on the data distribution, try
plotting commands for
different kinds of dimensions, try different plotting function options:
symbols, size of
plotting symbol, legends, x,y-axis labels, titles of graphs, etc)
Set-20
1. DBSCAN in WEKA

b. Get second attributes of the first 10 rows
c. Covariance of two attributes
d. Pair Plot
e. Box-Whisker Plots
f. Frequency of each class type
g. Density
h. Line charts for both numeric and categorical dimensions
symbols, size of
Set-21
1. Convert .csv file into .arff file using WEKA
Procedure:
1) Open Weka GUI Chooser(Weka)
2) Click on Tools
3)click on Arff viewer
4) Click on file and open a .csv file by changing the type of the file
5) click on file and save it as .arff file

b. Check the dimensionality of the chosen dataset
c. Display Variable names or column names
d. Check the Structure of the object
e. Get the last 6 rows
f. Distribution of every dimension
g. Variance of a numeric attribute
h. Scatter plot
symbols, size of
Timeseries.R
a <- ts(1:20, frequency = 12, start = c(2011, 3))
print(a)
## Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
## 2011 1 2 3 4 5 6 7 8 9 10
## 2012 11 12 13 14 15 16 17 18 19 20
str(a)
## Time-Series [1:20] from 2011 to 2013: 1 2 3 4 5 6 7 8 9 10...
attributes(a)
## $tsp
## [1] 2011 2013 12
##
## $class
## [1] "ts"
apts <- ts(AirPassengers, frequency = 12)
f <- decompose(apts)
plot(f$figure, type = "b")
plot(f)
# build an ARIMA model
fit <- arima(AirPassengers, order = c(1, 0, 0), list(order = c(2,
1, 0), period = 12))
fore <- predict(fit, n.ahead = 24)
# error bounds at 95% confidence level
U <- fore$pred + 2 * fore$se
L <- fore$pred - 2 * fore$se
ts.plot(AirPassengers, fore$pred, U, L,
col = c(1, 2, 4, 4), lty = c(1, 1, 2, 2))
legend("topleft", col = c(1, 2, 4), lty = c(1, 1, 2),
c("Actual", "Forecast", "Error Bounds (95% Confidence)"))
library(dtw)
idx <- seq(0, 2 * pi, len = 100)
a <- sin(idx) + runif(100)/10
b <- cos(idx)
align <- dtw(a, b, step = asymmetricP1, keep = T)
dtwPlotTwoWay(align)
# read data into R sep='': the separator is white space, i.e., one
# or more spaces, tabs, newlines or carriage returns
sc <- read.table("synthetic_control.txt", header = F)
sc
# show one sample from each class
idx <- c(1, 101, 201, 301, 401, 501)
sample1 <- t(sc[idx, ])
plot.ts(sample1, main = "")
# sample n cases from every class
n <- 10
s <- sample(1:100, n)
idx <- c(s, 100 + s, 200 + s, 300 + s, 400 + s, 500 + s)
sample2 <- sc[idx, ]
observedLabels <- rep(1:6, each = n)
# hierarchical clustering with Euclidean distance
hc <- hclust(dist(sample2), method = "ave")
plot(hc, labels = observedLabels, main = "")
# cut tree to get 8 clusters
memb <- cutree(hc, k = 8)
table(observedLabels, memb)
## memb
## observedLabels 1 2 3 4 5 6 7 8
## 1 10 0 0 0 0 0 0 0
## 2 0 3 1 1 3 2 0 0
## 3 0 0 0 0 0 0 10 0
## 4 0 0 0 0 0 0 0 10
## 5 0 0 0 0 0 0 10 0
## 6 0 0 0 0 0 0 0 10
myDist <- dist(sample2, method = "DTW")
hc <- hclust(myDist, method = "average")
plot(hc, labels = observedLabels, main = "")
# cut tree to get 8 clusters
memb <- cutree(hc, k = 8)
table(observedLabels, memb)
classId <- rep(as.character(1:6), each = 100)

newSc <- data.frame(cbind(classId, sc))
library(party)
ct <- ctree(classId ~ ., data = newSc,
controls = ctree_control(minsplit = 20,

minbucket = 5, maxdepth = 5))
pClassId <- predict(ct)
table(classId, pClassId)
## pClassId
## classId 1 2 3 4 5 6
## 1 100 0 0 0 0 0
## 2 1 97 2 0 0 0
## 3 0 0 99 0 1 0
## 4 0 0 0 100 0 0
## 5 4 0 8 0 88 0
## 6 0 3 0 90 0 7
# accuracy
(sum(classId == pClassId))/nrow(sc)
## [1] 0.8183
# extract DWT (with Haar filter) coefficients
library(wavelets)
wtData <- NULL
for (i in 1:nrow(sc)) {
a <- t(sc[i, ])
wt <- dwt(a, filter = "haar", boundary = "periodic")
wtData <- rbind(wtData, unlist(c(wt@W, wt@V[[wt@level]])))
}
wtData <- as.data.frame(wtData)
wtSc <- data.frame(cbind(classId, wtData))
ct <- ctree(classId ~ ., data = wtSc,
controls = ctree_control(minsplit=20, minbucket=5,

maxdepth=5))
pClassId <- predict(ct)

table(classId, pClassId)
## pClassId
## classId 1 2 3 4 5 6
## 1 98 2 0 0 0 0
## 2 1 99 0 0 0 0
## 3 0 0 81 0 19 0
## 4 0 0 0 74 0 26
## 5 0 0 16 0 84 0
## 6 0 0 0 3 0 97
(sum(classId==pClassId)) / nrow(wtSc)
## [1] 0.8883
plot(ct, ip_args = list(pval = F), ep_args = list(digits = 0))
k <- 20
newTS <- sc[501, ] + runif(100) * 15
distances <- dist(newTS, sc, method = "DTW")
s <- sort(as.vector(distances), index.return = TRUE)
# class IDs of k nearest neighbours
table(classId[s$ix[1:k]])
##
## 4 6
## 3 17
getwd()

Data Mining Lab Questions

Загружено:

Сведения о документе

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

Data Mining Lab Questions

Загружено:

Авторское право:

Доступные форматы

Data Mining

1. Convert .arff file into .csv file using Java-WEKA API

2. Linear Regression using R-Tool

K-nearest neighbours classifier. Can select appropriate value of K based on cross-

1. Convert .arff file into .csv file using WEKA

2. Demonstrate the classifier J48 in WEKA and Visualise Decision Tree

2. Perform the following preprocessing operations in WEKA

2. Random Forest in WEKA

2. Generate association rules using Apriori in WEKA

# Load Requried Packages

url <- "http://www.rdatamining.com/data/rdmTweets-201306.RData"

## load tweets into R

#get the data from whatsapp chat

#let us create the corpus

#clean our chat data

#create the document term matrix

#implementation using rpart

#implementation using ctree

# Hierarchical clustering using Complete Linkage

# Plot the obtained dendrogram

# Divise coefficient; amount of clustering structure found

prp(dtree_fit$finalModel, box.palette = "Reds", tweak = 1.2)

## Load Arthritis dataset (data frame)

#implementation using rpart

centers <- tibble(

labelled_points <- centers %>%

ggplot(labelled_points, aes(x1, x2, color = cluster)) +

points <- labelled_points %>%

2. Data visualisation and exploration in R

2. Data visualisation and exploration in R

classId <- rep(as.character(1:6), each = 100)

controls = ctree_control(minsplit = 20,

controls = ctree_control(minsplit=20, minbucket=5,

pClassId <- predict(ct)

Вам также может понравиться