1

Data Mining
Practical Questions
Submitted by:Pushpak Dhami

University Roll No. - 16013570036
Course: B.Sc. (H) Computer Science (Semester – VI)
College: College of Vocational Studies, Delhi University
Q1.Create a file “people.txt” with the following data:
(i) Read the data from the file “people .txt”
(ii) Create a rule set E that contains rules to check for the following conditions:
1. The age should be in the range 1-50.
2. The age should be greater than years married.
3. The status should be married or single or widowed.
4. If age is less than 18 the age group should be child, if age is between 18 and 65 the age
group should be adult , if age is more than 65 the age group should be elderly
(iii) Check whether rule set E is violated by the data in the file people.txt .
(iv) Summarize the results obtained in part (iii)
(v) Visualize the results obtained in part (iii)
people.txt
Age agegroup height status yearsmarried
21 adult 6 single -1
2 child 3 married 0
18 adult 5.7 married 20
221 elderly 5 widowed 2
34 child -7 married 3
people.r
install.packages("editrules")
library(editrules)
d = read.table("people.txt",header=TRUE) #select people.txt
attach(d)
E <- editset(expression(
age >= 0,
age <= 150,
age > yearsmarried,
status %in% c('single','married','widowed'),
if(age <= 18) agegroup %in% c('child'),
if(age > 18 && age < 65 ) agegroup %in% c('adult'),
if(age >= 65) agegroup %in% c('elderly')
))
sm <- violatedEdits(E,d)
sm <- as.data.frame(sm)
summary(sm)
plot(sm)
OUTPUT:
View(E) Ruleset
Violations Summary
Plotting Violation Summary

Program 2: Perform the following preprocessing tasks on the dirty_iris dataset.
i) Calculate the number and percentage of observations that are complete.
ii) Replace all the special values in data with NA.
iii) Define these rules in a separate text file and read them.
(Use editfile function in R (package editrules). Use similar function in Python).
Print the resulting constraint object.
– Species should be one of the following values: setosa, versicolor or virginica.
– All measured numerical properties of an iris should be positive.
– The petal length of an iris is at least 2 times its petal width.
– The sepal length of an iris cannot exceed 30cm.
– The sepals of an iris are longer than its petals.
iv) Determine how often each rule is broken (violated Edits). Also summarize and plot the
result.
v) Find outliers in sepal length using boxplot and boxplot.stats
library(dplyr)
library(editrules)
dirtyIris= read.csv(file='dirty_iriss.csv',sep=',',header=TRUE)
countt=0
for(i in 1:nrow(dirtyIris)){
for(j in 1:ncol(dirtyIris)){
if(is.na(dirtyIris[i,j])==TRUE){
countt=countt+1
break
}
}
}
print(countt)
percentage=countt*100/(nrow(dirtyIris));
print('Percentage of complete entries are: ')
print(100-percentage)
for(i in 1:nrow(dirtyIris)){
for(j in 1:4){
if(is.na(dirtyIris[i,j])!=TRUE){
if(is.infinite(dirtyIris[i,j])==TRUE | dirtyIris[i,j]<0){
dirtyIris[i,j]= NA
}
}
}
}
View(dirtyIris)
rules <- editfile(file = 'IrisRules.txt', type = 'all')
violations = violatedEdits(rules, dirtyIris, datamodel = TRUE)

View(summary(violations))
plot(violations)
stat = boxplot.stats(dirtyIris['Sepal.Length'])
View(stat$out)
View(dirtyIris[!is.na(dirtyIris$Sepal.Length) & dirtyIris$Sepal.Length %in% stat$out, ])
OUTPUT:
Number and percentage of complete observations

Special values into NA
IRIS data set rules in separate file
Read command for IRIS data set rules:

Summarization and plot of violations:
Outliers in sepal length:

Program 3: Load the data from wine dataset. Check whether all attributes are standardize
d or not (mean is 0 and standard deviation is 1). If not, standardize the attributes. Do the s
ame with Iris dataset.
library(BBmisc)
wine = read.csv(file='wine.csv', header = TRUE)
colMeanWine = c()
for(col in names(wine)){
if(class(wine[,col]) == 'numeric' || class(wine[,col]) == 'integer'){
colMeanWine[col] = colMeans(wine[col], na.rm = TRUE)
if(colMeanWine[col] != 0){
wine[col] = normalize(wine[col], method = "standardize")
}
}
}
for(col in names(wine)){
if(class(wine[,col]) == 'numeric' || class(wine[,col]) == 'integer'){
colMeanWine[col] = colMeans(wine[col], na.rm = TRUE)
}
}
View(wine)
View(colMeanWine)
colMeanIris = c()
for(col in names(iris)){
if(class(iris[,col]) == 'numeric' || class(iris[,col]) == 'integer'){
colMeanIris[col] = colMeans(iris[col], na.rm = TRUE)
if(colMeanIris[col] != 0){
iris[col] = normalize(iris[col], method = "standardize")
}
}
}
for(col in names(iris)){
if(class(iris[,col]) == 'numeric' || class(iris[,col]) == 'integer'){
colMeanIris[col] = colMeans(iris[col], na.rm = TRUE)
}
}
View(iris)
View(colMeanIris)
OUTPUT:
Column Mean of IRIS and WINE dataset
Run following algorithms on 2 real datasets and use appropriate evaluation measures to c
ompute correctness of obtained patterns:
Program 4: Run Apriori algorithm to find frequent itemsets and association rules
4.1 Use minimum support as 50% and minimum confidence as 75%
4.2 Use minimum support as 60% and minimum confidence as 60%
Program4.R
library(arules)
data(Adult)
rules <- apriori(Adult, parameter = list(supp = 0.5, conf = .75, target = "rules"))
summary(rules)
inspect(head(rules))
OUTPUT:
Summary(support 50,confidence 75):
Inspection of rules:
Summary(support 60,confidence 60)

Inspection of rules
Program 5: Use Naive bayes, K-nearest, and Decision tree classification algorithms and bui
ld classifiers. Divide the data set in to training and test set. Compare the accuracy of the di
fferent classifiers under the following situations:
5.1 a) Training set = 75% Test set = 25%
5.1 b) Training set = 66.6% (2/3rd of total), Test set = 33.3%
5.2 Training set is chosen by
i) hold out method
ii) Random subsampling

iii) Cross- Validation. Compare the accuracy of the classifiers obtained.
5.3 Data is scaled to standard format.
Program5.R
library(rpart) #for decison Tree

library(caret) #Confusion Matrix, train
library(e1071) #for Naive Bayes
library(class) #KNN
data(iris) #for loading dataset
#Holdout method
smp_size <- floor(0.75 * nrow(iris))
train <- iris[1:smp_size, ]
test <- iris[-(1:smp_size), ]
#Naive Bayes
model <- naiveBayes(Species ~ ., data = train)
prediction <- predict(model, test)
View(confusionMatrix(prediction, test[,5])$table, 'NaiveBayes using Holdout')
#Decision Tree
model <- rpart(Species ~ ., data = train)
prediction <- predict(model, test, type = "class")
View(confusionMatrix(prediction, test[,5])$table, 'Decision Tree using Holdout')
#KNN
prediction = knn(train[,-5], test[,-5], factor(train[,5]), k = 10)
View(confusionMatrix(prediction, test[,5])$table, 'KNN using Holdout')
#Random Subsampling
smp_size <- floor(0.75 * nrow(iris))
set.seed(123)
train_ind <- sample(nrow(iris), size = smp_size)
train <- iris[train_ind, ]
test <- iris[-train_ind, ]
#Naive Bayes
model <- naiveBayes(Species ~ ., data = train)
View(confusionMatrix(prediction, test[,5])$table, 'NaiveBayes using Random Sampling')
#Decision Tree
model <- rpart(Species ~ ., data = train)
prediction <- predict(model, test, type = "class")
View(confusionMatrix(prediction, test[,5])$table, 'Decision Tree using Random Sampling')
#KNN
prediction = knn(train[,-5], test[,-5], factor(train[,5]), k = 10)
View(confusionMatrix(prediction, test[,5])$table, 'KNN using Random Sampling')
#K-Fold Cross Validation Naive Bayes
train_control <- trainControl(method="cv", number=10)
model <- train(Species~., data=iris, trControl=train_control, method="nb")
View(confusionMatrix(prediction, test[,5])$table, 'NaiveBayes using KFOLD')
#K-Fold Cross Validation Decision Tree

model <- train(Species~., data=iris, trControl=train_control, method="rpart")
View(confusionMatrix(prediction, test[,5])$table, 'Decision Tree using KFOLD')
#K-Fold Cross Validation KNN

model <- train(Species~., data=iris, trControl=train_control, method="knn")
View(confusionMatrix(prediction, test[,5])$table, 'KNN using KFOLD')
OUTPUT:
Decesion Tree Using HOLDOUT:
KNN Using Random Sampling:

NaiveBayes Using Random Sampling:
Decesion Tree Using Random Sampling:
KNN Using HOLDOUT:
NaiveBayes Using KFOLD:

Decesion Tree Using KFOLD:
KNN using KFOLD:

Program 6: Use Simple Kmeans, DBScan, Hierachical clustering algorithms for clustering. C
ompare the performance of clusters by changing the parameters involved in the algorithm
s.
program6.R
library(dbscan)
#kmeans
cl <- kmeans(iris[,-5], 3)
plot(iris[,-5], col = cl$cluster)
points(cl$centers, col = 1:3, pch = 8)
#heirarchical
clusters <- hclust(dist(iris[, -5]))
plot(clusters)
#DBScan
cl <- dbscan(iris[,-5], eps = .5, minPts = 5)
plot(iris[,-5], col = cl$cluster)
OUTPUT:
K-Means Cluster
Hierarchical Cluster
DB Scan CLustering

1

Загружено:

Сведения о документе

Исходное описание:

Оригинальное название

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

1

Загружено:

Авторское право:

Доступные форматы

Data Mining

Submitted by:Pushpak Dhami

Q1.Create a file “people.txt” with the following data:

(i) Read the data from the file “people .txt”

1. The age should be in the range 1-50.

2. The age should be greater than years married.

3. The status should be married or single or widowed.

(iv) Summarize the results obtained in part (iii)

(v) Visualize the results obtained in part (iii)

Age agegroup height status yearsmarried

18 adult 5.7 married 20

221 elderly 5 widowed 2

d = read.table("people.txt",header=TRUE) #select people.txt

age <= 150,

age > yearsmarried,

status %in% c('single','married','widowed'),

if(age <= 18) agegroup %in% c('child'),

if(age > 18 && age < 65 ) agegroup %in% c('adult'),

if(age >= 65) agegroup %in% c('elderly')

Plotting Violation Summary

i) Calculate the number and percentage of observations that are complete.

ii) Replace all the special values in data with NA.

(Use editfile function in R (package editrules). Use similar function in Python).

Print the resulting constraint object.

– Species should be one of the following values: setosa, versicolor or virginica.

– All measured numerical properties of an iris should be positive.

– The petal length of an iris is at least 2 times its petal width.

– The sepal length of an iris cannot exceed 30cm.

– The sepals of an iris are longer than its petals.

rules <- editfile(file = 'IrisRules.txt', type = 'all')

violations = violatedEdits(rules, dirtyIris, datamodel = TRUE)

Number and percentage of complete observations

IRIS data set rules in separate file

Read command for IRIS data set rules:

Outliers in sepal length:

wine = read.csv(file='wine.csv', header = TRUE)

Column Mean of IRIS and WINE dataset

4.2 Use minimum support as 60% and minimum confidence as 60%

Summary(support 50,confidence 75):

Summary(support 60,confidence 60)

5.1 a) Training set = 75% Test set = 25%

5.1 b) Training set = 66.6% (2/3rd of total), Test set = 33.3%

5.2 Training set is chosen by

i) hold out method

ii) Random subsampling

5.3 Data is scaled to standard format.

library(rpart) #for decison Tree

#K-Fold Cross Validation Decision Tree

#K-Fold Cross Validation KNN

Decesion Tree Using HOLDOUT:

KNN Using Random Sampling:

KNN Using HOLDOUT:

NaiveBayes Using KFOLD:

KNN using KFOLD:

Вам также может понравиться