Вы находитесь на странице: 1из 26

Question Bank

Course & Branch : B. E. (CSE)& IT Semester: 6th

Subject : Data Warehouse & Data Mining Subject Code: CST-354

No. of Students: 1000 Regular/ Reappear: Regular

Short Answer Type Questions

Level Question
Sr.
in BT Identify location/ Question
No
CO Topic
L1 CO1 Unit -
1/introdu
1 Why do many enterprises need a data warehouse? Give two major components of
ction
any data warehouse system.
L1 CO1 Unit -1 Explain why ETL must deal with dirty data when extracting information from the
2
/ETL source systems.
L1 CO1 Unit -
3 1/ETL
What is ETL, List the major steps involved in the ETL process
L2 CO1 Unit -
1/multidi
mensiona
4
l view and
data cube
Explain multidimensional view and various data cube operations.
L2 CO1 Unit -
5 1/OLAP
Difference between OLAP & OLTP.
L2 CO2 Unit -
1/star
6 and
snowflake
Explain the difference between star schema & snow flake schema.
L1 CO2 Unit -
1/Fact
7 constellat
ion
List the various advantages & disadvantages of fact constellation schema.
L2 CO2 Unit -
1/olap
operation
in
8 multidime
nsional
data
model Enlist the various OLAP Operations in the Multidimensional Data Model.

L3 CO2 Unit - Elaborate the concept of Starnet Query Model for Querying Multidimensional
9 1/Star Databases.
schema
L3 CO1 Unit -
1/data
10 warehous
e
List some of the design guideleines for data warehouse implementation
L3 CO2 Unit -
2/apriori
11
algorithm
What is Apriori algorithm &name some variants of Apriori Algorithm.
L2 CO2 Unit -2/
naïve
12
algorithm
Explain Naive Baye’s Classification.
L3 CO2 Unit -
2/direct
hashing
13
and
pruning What are the various factors that are affecting the complexity of direct hashing and
pruning.
L2 CO2 Unit -
2/dynami
14 c itemset
counting
Explain the various operations Dynamic Itemset counting(DIC).
L2 CO2 Unit -
2/frequen
t pattern
without
15
candidate
generatio
n
Elaborate the concept of FP growth with the help of example.
L2 CO2 Unit -
16 2/mining What are the various association rules of data mining.Explain with the help of
example.
L1 CO1 Unit -2
/need for
17 data
preproces
sing Why there is a need for data preprocessing.
L1 CO3 Unit -
2/data
18
cleaning
Describe different data cleaning approaches.
L2 CO3 Unit -2
Data
integratio
n&transfo
rmation,d
19 ata
reduction
, data Explain the following terms:
discretiza i. Data Integration and Transformation
tion & ii. Data Reduction
concept iii. Data Discretization and Concept Hierarchy Generation.
hierarchy
generatio
n
L3 CO3 Unit -
2/perfor
mance
evaluatio
20
n of
algorithm
s
How the evaluation of various algorithms being accessed.
L1 CO3 Enlist the classification techniques of data mining.
Unit-3/ Ans:
introducti
on to web
21  Building the Classifier or Model
data
mining
 Using Classifier for Classification

L2 CO3 What is a decision tree? How a decision tree works?


Ans:
Decision tree is a type of supervised learning algorithm (having a pre-defined
target variable) that is mostly used in classification problems.

Unit-
3/decisio
22
n tree

It works for both categorical and continuous input and output variables. In this
technique, we split the population or sample into two or more homogeneous
sets (or sub-populations) based on most significant splitter / differentiator
in input variables.

L2 CO3 Elaborate the example of decision tree induction algorithm.


Ans:
Let’s say we have a sample of 30 students with three variables Gender (Boy/
Girl), Class( IX/ X) and Height (5 to 6 ft). 15 out of these 30 play cricket
in leisure time. Now, I want to create a model to predict who will play cricket
Unit-3
during leisure period? In this problem, we need to segregate students who play
/tree
23 cricket in their leisure time based on highly significant input variable among
induction
all three.
algorithm
This is where decision tree helps, it will segregate the students based on all
values of three variable and identify the variable, which creates the best
homogeneous sets of students (which are heterogeneous to each other). In the
snapshot below, you can see that variable Gender is able to identify best
homogeneous sets compared to the other two variables.
L3 CO3 What is the difference between split algorithm based on:
i. Information theory
ii. Gini index
Ans:
Gini Index
Gini index says, if we select two items from a population at random then they
must be of same class and probability for this is 1 if population is pure.
 It works with categorical target variable “Success” or “Failure”.
 It performs only Binary splits
 Higher the value of Gini higher the homogeneity.
 CART (Classification and Regression Tree) uses Gini method to create binary
splits.
Unit-
3/split
24 algorithm Information Gain:
s Look at the image below and think which node can be described easily. I am
sure, your answer is C because it requires less information as all values are
similar. On the other hand, B requires more information to describe it and A
requires the maximum information. In other words, we can say that C is a Pure
node, B is less Impure and A is more impure.

L2 CO3 Enlist the various classification & prediction issues.


Unit- Ans:
3/classific  Data Cleaning
ation
25
introducti  Relevance Analysis
on
 Data Transformation and reduction

L2 CO3 What do you mean by cluster analysis & list the various requirements of it.

Unit-3
Requirements:
/cluster
26 analysis  Scalability
introducti
on  Ability to deal with different kinds of attributes

 Discovery of clusters with attribute shape

 High dimensionality

 Ability to deal with noisy data


 Interpretability

L2 CO3 Unit-
3/partitio
27 nal
methods
What are the various advantages& disadvantages of partitional methods
L2 CO3 Difference between hierrarchical methods& density based methods

Hierarchical Methods

This method creates a hierarchical decomposition of the given set of data


objects. We can classify hierarchical methods on the basis of how the
hierarchical decomposition is formed. There are two approaches here −
Unit-
3/hieraric
hal and  Agglomerative Approach
28 density
 Divisive Approach
based
methods Density-based Method

This method is based on the notion of density. The basic idea is to


continue growing the given cluster as long as the density in the
neighborhood exceeds some threshold, i.e., for each data point within a
given cluster, the radius of a given cluster has to contain at least a
minimum number of points.

L3 CO3 Unit-3
/dealing
29
with large
databases What are some efficient ways to perform K-means on large datasets.
L3 CO3 How clustering is useful in web data mining.
Ans:
Unit-
Clustering also helps in classifying documents on the web for
3/cluster
30 information discovery.
software
As a data mining function, cluster analysis serves as a tool to
gain insight into the distribution of data to observe
characteristics of each cluster.

Long Answer Type Questions


Level Identify Question
Sr.
in BT CO Location/ Question
No
Topic
L1 CO1 Unit -1
Average/i
ntroducti
1 on to
data
warehous Relate business intelligence with Data warehouse? Explain the role of ETL in data
e & ETL warehouse with suitable diagram.
L2 CO1 Unit -1
2 Average/ How a data warehouse is different from a database? Explain the architecture of Data
operation Warehouse in detail.
al data
stores
and
dataware
house
L3 CO1 Unit -1
Average/
design
guidelines
3 for
dataware
house
implemen Present a list of guidelines that may assist in the success of a data warehouse project
tation implementation? Discuss the importance of each of these guidelines.
L2 CO1 Unit -1 Differentiate between:
Average/
OLAP,dat a) OLTP and OLAP
4 b) Data mart and ODS
a
mart,met c) ETL and Meta data
a data d) Subject oriented and Time-variant
L2 CO1 Unit -1
Average/
design
guidelines
5 for data
warehous
e
implemen Enlist and explain the various key issues to be considered while planning for a Data
tation warehouse? Also mention its advantages w.r.t various businesses.
L2 CO2 Unit -1
Average/
6
star
schema List various advantages and disadvantages of star schema
L1 CO3 Unit -1
Average/s
tar, fact &
constellat
ion
schema,
7
dataware
house,
data mart Briefly compare the following concepts. You may use an example to explain your
&virtual point(s).
warehous a. Snowflake schema, fact constellation, Starnet query model
e b. Enterprise warehouse, data mart, virtual warehouse
L2 CO3 Unit -1
Average/i
ntroducti
on to
data
8 warehous
e & how
related to
business
intelligen Explain the concept of business intelligence? How data warehouse is useful in
ce Business intelligence.
L2 CO3 Unit -1
Average/
data
9
warehous
e & its List the types of data warehouses. What are the various functions of data
functions warehouse tools and utilities.
L3 C02 Unit -1
Average/
metadata
10
&
dataware Enlist the types of metadata contained in metadata repository? Explain its types in
house data warehouse.
L3 CO2 Unit -1
Difficult/d
ata mart
11
& data
warehous
e Relate the concept data marts with data warehouse? Explain its types with diagram.
L3 CO2 Unit -1
Difficult/
12 data
warehous How the process flow in data warehouse. Explain the concept with diagram.
e
L3 CO2
Unit -1 A company is using OLAP to provide monthly summary information about its
Difficult/ product and branch sales to the company manager’s .How many different
13 OLAP aggregates would be required to fill a data cube on product, branches and dates if
operation there are 20 products, 10 branches and five years of monthly data.
s

L2 CO2 Unit -1
Difficult/d
ata cube
14
& data Define the terminology used in data cube. Use an Example to illustrate the use of a
cube data cube.
operation
L2 CO2 Unit -1
Difficultst
15 ar &
snowflake Describe how a data warehouse is modeled and implemented using the star schema
schema and snowflake schema. Explain using an example.
L3 CO2 Unit -1
Difficult/d
16 ata
warehous
e Why partition is required in data warehousing? Explain with real life example.
L1 CO1 Unit -1
Difficult/
17 data
warehous Draw the complete architecture of data warehouses. what are the various stages
e involved in it.
L2 CO1 Unit -1
Difficult
18
operation How company/organization differ between data warehouse and operational
al data databases?
stores&
data
warehous
e
L1 CO1 Unit -1
19 Difficult/E Why ETL must deal with dirty data when extracting information from the source
TL systems.
L1 CO1 Unit -1
20 Difficult/d
ata cube Which operation is used to re-orient the view of data cube? Give example.
L1 CO1 Unit -1
Difficult/d
21 ata
warehous
e & ETL Illustrate the role of ETL in data warehouse with suitable diagram.
L2 CO1 Unit -1
Difficult/
fact
22 constellat
ion &
snowflake How fact constellation schema and snowflake schema. Explain with the help of
schema suitable examples.
L1 CO3 Unit -1
23 Difficult/ How can you define ROLAP and MOLAP? Describe these two approaches and list
OLAP their advantages and disadvantages.
L2 CO3 Unit -1 Write about operational Database Systems & Data Warehouse (OLTP & OLAP) in
24 Difficult/o detail with example.
lap &oltp
L1 CO3 Unit -1
Difficult/d
aata flows
25
in data What all are the data flows and various managers in data warehouse explain with
warehous with a suitable diagram.
e
L1 CO1 Unit -1
Difficult
26 ETL
architectu
re Enlist the various steps in Extraction-Transformation-loading with neat diagram?
L2 CO1 Unit -1
Difficult/
27
OLAP Give the names of storage models of OLAP? Explain the concepts with the help of
models example.
L1 CO1 Unit -1
Difficult/
multidime
28 nsional
view
&data Describe multidimensional view with the help of example & Frame the various data
cube cube operations.
L2 CO1 Unit -1
Difficult/s
tarnet
29
query
model for Enlist the types of queries do managers need to pose to the enterprise’s database
querying systems?
multidime
nsional
databases
L2 CO1 Unit -1
30 Difficult/ Are all data cube entries non-zero? If not, why not? Explain with the help of
data cube examples.
L1 CO1 Assuming minimum Support of 60 % and minimum confidence of 80% ,find all
frequent items and list all association rules using the Apriori Algorithm.
Unit -2 Transactions List of items
Average/ T100 K,A,D,B
31
apriori
T200 D,A,C,E,B
algorithm
T300 C,A,B,E
T400 B,A,D
L1 CO1 Unit -2
Average/
32
naïve Why Naïve’s algorithm is not a good example for large number of items? Justify your
algorithm answer with a suitable example.
L2 CO1 Unit -2
Average/
33
data Full form of KDD? How it is related to data mining? Explain the concept with suitable
mining example.
L3 CO1 Transaction ID Items
100 Bread,Milk,Juice,Cheese
200 Bread,Milk,Cheese
Unit -2 300 Bread,Milk
Average/
34 400 Bread,Juice,Cheese
apriori
algorithm a) use the Apriori algorithm for finding frequently item sets with minimum support
90% and minimum confidence of 90%

b)Repeat the Algorithm for minimum support of 40 % and minimum confidence of


90 %.
L2 CO1 Unit -2
Average/ a)In real life how data mining is helpful ?
35
data b)Differentiate between descriptive and predictive data mining.
mining
L2 CO1 Unit -2
Average/
36
data
mining Discuss some of the Reasons for growth in enterprise data.
L2 CO1 Unit -2
Average/
37
associatio
n rules How prevalence is different from confidence? What are confident rules?
L1 CO3 Unit -2
Average/
38
data
mining List some applications of Data mining .
L2 CO2 Unit -2
39
Average Give few techniques to improve the efficiency of Apriori algorithm.
L2 CO2 Unit -2
Average/
40
naïve a)Explain the Naiive algorithm for finding Association rules .
algorithm b)List some of the Weakness of the Naiive algorithm
L3 CO2 Unit -2
Difficult
direct
41
hashing&
pruning, In what way Hashing method works. Estimate how much work will be needed to
apriori compute association rules compared to Apriori. Make suitable Assumptions.
L2 CO3 Unit -2
Difficult/
42
classificati Draw the role of classification technique? What are the various algorithms under
on classification technique and explain the process of classification
L3 CO2 Unit -2
Difficult/
mining
frequent
43 pattern
without
a) Explain algorithm for FP tree construction .
candidate
generatio b) How frequent items are generated from FP tree with Example .
n
L3 CO2
TID Item sets
1 F,a,c,d,g,m,p
2 a,b,c,f,l,m,o
Unit -2
Difficult 3 b,f,h,o
44
/FP 4 b,k,c,p
growth
5 a,f,c,l,p,m,n

Construct an FP growth tree for the frequent item sets .

L1 CO1 Suppose that a data warehouse consists of the three dimensions time, doctor, and
patient, and the two measures count and charge, where charge is the fee that a
Unit -2 doctor charges a patient for a visit.
45 Difficult/s (a) Enumerate three classes of schemas that are popularly used for modeling data
chema warehouses.
(b) Draw a schema diagram for the above data warehouse using one of the schema
classes listed in (a).
L1 CO2 Unit -2 Elaborate the method of generating frequent item sets without candidate
46 Difficult/F generation.
P growth
L2 CO3 Unit -2
Difficult/d
47
ata
mining List out some social Impact of Data mining .
L1 CO1 Unit -2
Difficult/a
48 priori &
naïve How apriori & naïve algorithm differs. Which is better? Illustrate the concept with an
algorithm example
L1 CO2 Unit -2
Difficult/s
49
nowflake
schema Elaborate the various Steps required to design the snowflake schema with example.
L1 CO2 Unit -2
Difficult/p
erforman
ce
50
evaluattio
n of
algorithm
s What are the various factors which are affecting the complexity of algorithms.
L2 CO2 Unit -2
Difficult/n
eed of
51
pre
processin
g List the various tasks to be accomplished as part of data pre-processing.
L1 CO2 Unit -2
Difficult/d
52 ata
integratio
n List the issues to be considered during data integration.
L1 CO3 Unit -2
Difficult/
53
data
mining Write the real life applications of data mining.
L2 CO2 Unit -2
Difficult/s
oftware
54 for
associatio
n rule How association rule mining is used in industry. What are the various softwares used
mining for association rule mining.
L3 CO2 Consider the Data set D. Given the minimum support2, apply apriori algorithm on
this dataset.

Unit -2 Transaction ID Items


Difficult/a 100 A,C,D
55
priori 200 B,C,E
algorithm 300 A,B,C,E
400 B,E

L1 CO1 Unit -2 How a data Warehouse is modeled and implemented using the star schema and the
56
Difficult snowflake schema. Explain using an Example.
L1 CO3 Unit -2
Difficult/d
57
ata
reduction Enlist the various strategies for data reduction. Explain?
L1 CO3 Unit -2
Difficult/d
58
ata
cleaning How do you clean the data? Explain.
L1 CO3 Unit -2
Difficult/s
oftware
59 for
associatio
n rule
mining How are association rules mined from large databases.
L2 CO2 Unit -2
Difficult/a
60
priori
algorithm List the techniques to improve the efficiency of Apriori algorithm.
L1 CO2 What are the various methods to deal with large databases. explain with the help
of real life examples

Ans:

 Partitioning Method

 Hierarchical Method

 Density-based Method

 Grid-Based Method

 Model-Based Method

 Constraint-based Method

Partitioning Method

Suppose we are given a database of ‘n’ objects and the partitioning


method constructs ‘k’ partition of data. Each partition will represent a
cluster and k ≤ n. It means that it will classify the data into k groups,
Unit -3
Average/ which satisfy the following requirements −
61 dealing
with large  Each group contains at least one object.
datasets
 Each object must belong to exactly one group.

Hierarchical Methods

This method creates a hierarchical decomposition of the given set of data


objects. We can classify hierarchical methods on the basis of how the
hierarchical decomposition is formed. There are two approaches here −

 Agglomerative Approach

 Divisive Approach

Density-based Method

This method is based on the notion of density. The basic idea is to


continue growing the given cluster as long as the density in the
neighborhood exceeds some threshold, i.e., for each data point within a
given cluster, the radius of a given cluster has to contain at least a
minimum number of points.
Grid-based Method

In this, the objects together form a grid. The object space is quantized
into finite number of cells that form a grid structure.

Advantages

 The major advantage of this method is fast processing time.

 It is dependent only on the number of cells in each dimension in


the quantized space.

Model-based methods

In this method, a model is hypothesized for each cluster to find the best
fit of data for a given model. This method locates the clusters by
clustering the density function. It reflects spatial distribution of the data
points.

This method also provides a way to automatically determine the number


of clusters based on standard statistics, taking outlier or noise into
account. It therefore yields robust clustering methods.

Constraint-based Method

In this method, the clustering is performed by the incorporation of user or


application-oriented constraints. A constraint refers to the user
expectation or the properties of desired clustering results. Constraints
provide us with an interactive way of communication with the clustering
process. Constraints can be specified by the user or the application
requirement.

Real time example:

Order management system database

Health care provider database

Scientific database, etc

L1 CO2 What is classification? List some criteria for evaluating classification methods.
Discuss them Briefly.
Unit -3
Average/ Ans:
62
classificat
ion Classification is a forms of data analysis that can be used for
extracting models describing important classes or to predict
future data trends. Classification models predict categorical class
labels.

Following are the examples of cases where the data analysis


task is Classification −

 A bank loan officer wants to analyze the data in order to


know which customer (loan applicant) are risky or which
are safe.

 A marketing manager at a company needs to analyze a


customer with a given profile, who will buy a new
computer.

In both of the above examples, a model or classifier is


constructed to predict the categorical labels. These labels are
risky or safe for loan application data and yes or no for
marketing data.

Classification Methods:

Genetic Algorithms

The idea of genetic algorithm is derived from natural evolution.


In genetic algorithm, first of all, the initial population is created.
This initial population consists of randomly generated rules. We
can represent each rule by a string of bits.

Rough Set Approach

We can use the rough set approach to discover structural


relationship within imprecise and noisy data.

Note − This approach can only be applied on discrete-valued


attributes. Therefore, continuous-valued attributes must be
discretized before its use.

The Rough Set Theory is based on the establishment of


equivalence classes within the given training data. The tuples
that forms the equivalence class are indiscernible. It means the
samples are identical with respect to the attributes describing
the data.

Fuzzy Set Approaches

Fuzzy Set Theory is also called Possibility Theory. This theory


was proposed by Lotfi Zadeh in 1965 as an alternative the two-
value logic and probability theory. This theory allows us to
work at a high level of abstraction. It also provides us the
means for dealing with imprecise measurement of data.

L2 CO2 Elaborate on the methods of estimating accuracy of a classification method.

Ans:
Accuracy − Accuracy of classifier refers to the ability of
Unit -3
classifier. It predict the class label correctly and the accuracy of
Average/
the predictor refers to how well a given predictor can guess the
predictive
value of predicted attribute for a new data.
accuracy
63
of
classificat Methods:
ion  Hold out
method  Random sub sampling
 K-fold
 Cross-validation
 Leave method
 Bootstrap method.
L3 CO2 What kind of data is the decision tree method most suitable for? Briefly outline
the major steps of the algorithm to construct a decision tree? Explain each step.

Ans:
A decision tree is a structure that includes a root node,
branches, and leaf nodes. Each internal node denotes a test on
an attribute, each branch denotes the outcome of a test, and
each leaf node holds a class label. The topmost node in the tree
is the root node.

The following decision tree is for the concept buy_computer that


Unit -3 indicates whether a customer at a company is likely to buy a
Average/ computer or not. Each internal node represents a test on an
decision attribute. Each leaf node represents a class.
64 tree,tree
induction
algorithm
m
The benefits of having a decision tree are as follows −

 It does not require any domain knowledge.

 It is easy to comprehend.

 The learning and classification steps of a decision tree are


simple and fast

Algorithm : Generate_decision_tree

Input:
Data partition, D, which is a set of training tuples
and their associated class labels.
attribute_list, the set of candidate attributes.
Attribute selection method, a procedure to determine the
splitting criterion that best partitions that the data
tuples into individual classes. This criterion includes a
splitting_attribute and either a splitting point or splitting subset.

Output:
A Decision Tree

Method
create a node N;

if tuples in D are all of the same class, C then


return N as leaf node labeled with class C;

if attribute_list is empty then


return N as leaf node with labeled
with majority class in D;|| majority voting

apply attribute_selection_method(D, attribute_list)


to find the best splitting_criterion;
label node N with splitting_criterion;

if splitting_attribute is discrete-valued and


multiway splits allowed then // no restricted to binary trees

attribute_list = splitting attribute; // remove splitting attribute


for each outcome j of splitting criterion

// partition the tuples and grow subtrees for each partition


let Dj be the set of data tuples in D satisfying outcome j; // a
partition

if Dj is empty then
attach a leaf labeled with the majority
class in D to node N;
else
attach the node returned by Generate
decision tree(Dj, attribute list) to node N;
end for
return N;

L3 CO2 Unit -3
65
Average
L3 CO3 Smooth the given data 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34 on the basis of equi-
depth, partitioning on the basis of boundaries and on the basis of mean
Unit -3
Average/
66 Partition into (equi-depth) bins:
partitiona
-Bin 1: 4, 8, 9, 15
l methods
-Bin 2: 21, 21, 24, 25
-Bin 3: 26, 28, 29, 34
* Smoothing by bin means:
-Bin 1: 9, 9, 9, 9
-Bin 2: 23, 23, 23, 23
-Bin 3: 29, 29, 29, 29
* Smoothing by bin boundaries:
-Bin 1: 4, 4, 4, 15
-Bin 2: 21, 21, 25, 25
-Bin 3: 26, 26, 26, 34
L2 CO2 To find frequent item sets which algorithm is best suitable and explain how it
works.

Ans:

Decision Tree Induction Algorithm

A machine researcher named J. Ross Quinlan in 1980 developed


a decision tree algorithm known as ID3 (Iterative
Dichotomiser). Later, he presented C4.5, which was the
successor of ID3. ID3 and C4.5 adopt a greedy approach. In
this algorithm, there is no backtracking; the trees are
constructed in a top-down recursive divide-and-conquer
manner.
Generating a decision tree form training tuples of data partition D
Algorithm : Generate_decision_tree

Input:
Data partition, D, which is a set of training tuples
and their associated class labels.
Unit -3 attribute_list, the set of candidate attributes.
Average/ Attribute selection method, a procedure to determine the
splitting criterion that best partitions that the data
67 tree tuples into individual classes. This criterion includes a
induction splitting_attribute and either a splitting point or splitting subset.
algorithm
Output:
A Decision Tree

Method
create a node N;

if tuples in D are all of the same class, C then


return N as leaf node labeled with class C;

if attribute_list is empty then


return N as leaf node with labeled
with majority class in D;|| majority voting

apply attribute_selection_method(D, attribute_list)


to find the best splitting_criterion;
label node N with splitting_criterion;

if splitting_attribute is discrete-valued and


multiway splits allowed then // no restricted to binary trees

attribute_list = splitting attribute; // remove splitting attribute


for each outcome j of splitting criterion

// partition the tuples and grow subtrees for each partition


let Dj be the set of data tuples in D satisfying outcome j; // a
partition

if Dj is empty then
attach a leaf labeled with the majority
class in D to node N;
else
attach the node returned by Generate
decision tree(Dj, attribute list) to node N;
end for
return N;

L2 CO3 How attributes selected for a split based on the Gini index?
Ans:

Unit -3
Average/
split
68
algorithm
based on
gini index

L3 CO2 How bootstrapping, bagging and boosting improve the accuracy of classifications.
Ans:
To understand bootstrap, suppose it were possible to draw repeated samples (of the same
size) from the population of interest, a large number of times. Then, one would get a fairly
good idea about the sampling distribution of a particular statistic from the collection of its
values arising from these repeated samples. But, that does not make sense as it would be too
expensive and defeat the purpose of a sample study. The purpose of a sample study is to
gather information cheaply in a timely fashion. The idea behind bootstrap is to use the data of
a sample study at hand as a “surrogate population”, for the purpose of approximating the
sampling distribution of a statistic; i.e. to resample (with replacement) from the sample data at
Unit -3 hand and create a large number of “phantom samples” known as bootstrap samples. The
Average/ sample summary is then computed on each of the bootstrap samples (usually a few thousand).
69 A histogram of the set of these computed values is referred to as the bootstrap distribution of
classificat
ions the statistic.
In other words, We randomly sample with replacement from the n known observations. We
then call this a bootstrap sample. Since we allow for replacement, this bootstrap sample most
likely not identical to our initial sample. Some data points may be duplicated, and others data
points from the initial may be omitted in a bootstrap sample.

An Example

The following numerical example will help to demonstrate how the process works. If we begin
with the sample 2, 4, 5, 6, 6, then all of the following are possible bootstrap samples:
 2 ,5, 5, 6, 6
 4, 5, 6, 6, 6
 2, 2, 4, 5, 5
 2, 2, 2, 4, 6
 2, 2, 2, 2, 2
 4,6, 6, 6, 6
Boosting and Bagging.
Bootstrap aggregating (bagging) is a machine learning ensemble meta-algorithm designed to
improve the stability and accuracy of machine learning algorithms used in statistical
classification and regression. It also reduces variance and helps to avoid overfitting. Although it
is usually applied to decision tree methods, it can be used with any type of method.

Both boosting and bagging are ensemble techniques — instead of learning a single classifier,
several are trained and their predictions combined. While bagging uses an ensemble of
independently trained classifiers, boosting is an iterative process that attempts to mitigate
prediction errors of earlier models by predicting them with later models.

L1 CO2 Unit -3
70
Average/ What are the various types of classification.explain with the help of example
L3 CO2 What will be the class value when X= (Rain, Cool,High, Strong)?

Unit -3

Difficult/c
71 luster
analysis

L2 CO2 How Root node is evaluated in Decision tree. Explain the Decision Tree induction
algorithm.
Ans:

Decision Tree Induction Algorithm

A machine researcher named J. Ross Quinlan in 1980 developed


a decision tree algorithm known as ID3 (Iterative
Unit -3
Dichotomiser). Later, he presented C4.5, which was the
Difficult/
72 successor of ID3. ID3 and C4.5 adopt a greedy approach. In
decision
tree this algorithm, there is no backtracking; the trees are
constructed in a top-down recursive divide-and-conquer
manner.
Generating a decision tree form training tuples of data partition D
Algorithm : Generate_decision_tree

Input:
Data partition, D, which is a set of training tuples
and their associated class labels.
attribute_list, the set of candidate attributes.
Attribute selection method, a procedure to determine the
splitting criterion that best partitions that the data
tuples into individual classes. This criterion includes a
splitting_attribute and either a splitting point or splitting subset.

Output:
A Decision Tree

Method
create a node N;

if tuples in D are all of the same class, C then


return N as leaf node labeled with class C;

if attribute_list is empty then


return N as leaf node with labeled
with majority class in D;|| majority voting

apply attribute_selection_method(D, attribute_list)


to find the best splitting_criterion;
label node N with splitting_criterion;

if splitting_attribute is discrete-valued and


multiway splits allowed then // no restricted to binary trees

attribute_list = splitting attribute; // remove splitting attribute


for each outcome j of splitting criterion

// partition the tuples and grow subtrees for each partition


let Dj be the set of data tuples in D satisfying outcome j; // a
partition

if Dj is empty then
attach a leaf labeled with the majority
class in D to node N;
else
attach the node returned by Generate
decision tree(Dj, attribute list) to node N;
end for
return N;

L1 CO2 Elaborate the terms with suitable examples:


a. Overfitting
b. Pruning in decision tree
Ans:
Overfitting in Decision Trees
• If a decision tree is fully grown, it may lose some generalization capability.
• This is a phenomenon known as overfitting.

Unit -3 Tree Pruning


Difficult/
73
decision Tree pruning is performed in order to remove anomalies in the
tree
training data due to noise or outliers. The pruned trees are
smaller and less complex.

Tree Pruning Approaches


There are two approaches to prune a tree −

 Pre-pruning − The tree is pruned by halting its


construction early.
 Post-pruning - This approach removes a sub-tree from a
fully grown tree.

L1 CO2 How partitional & hierarichal clustering methods differ from one another.

Ans:
Hierarchical and Partitional Clustering have key differences in running
time, assumptions, input parameters and resultant clusters.
Typically, partitional clustering is faster than hierarchical clustering.
Unit -3 Hierarchical clustering requires only a similarity measure, while partitional
Difficult/c clustering requires stronger assumptions such as number of clusters and
74
lustering the initial centers.
methods Hierarchical clustering does not require any input parameters, while
partitional clustering algorithms require the number of clusters to start
running.
Hierarchical clustering returns a much more meaningful and subjective
division of clusters but partitional clustering results in exactly k clusters.
Hierarchical clustering algorithms are more suitable for categorical data as
long as a similarity measure can be defined accordingly.
L1 CO2 Give the name of method to calculate Gain in case of decision tree. What is
entropy? How it can be calculated in case of Decision tree?
Ans:
Entropy
A decision tree is built top-down from a root node and involves partitioning the da
instances with similar values (homogenous). ID3 algorithm uses entropy to calcula
the sample is completely homogeneous the entropy is zero and if the sample is an
one.

Unit -3 To build a decision tree, we need to calculate two types of entropy using frequenc
Difficult/
75
decision
tree a) Entropy using the frequency table of one attribute:
b) Entropy usi
attributes:

L1 CO2 Which is better among partitional, hierarichal & density based methods and why.

Ans:

Partitioning Method

Suppose we are given a database of ‘n’ objects and the partitioning


method constructs ‘k’ partition of data. Each partition will represent a
cluster and k ≤ n. It means that it will classify the data into k groups,
which satisfy the following requirements −

 Each group contains at least one object.

Unit -3  Each object must belong to exactly one group.


Difficult/
partitiona Hierarchical Methods
l,hieraric
76 This method creates a hierarchical decomposition of the given set of data
hal&
density objects. We can classify hierarchical methods on the basis of how the
based hierarchical decomposition is formed. There are two approaches here −
methods
 Agglomerative Approach

 Divisive Approach

Density-based Method

This method is based on the notion of density. The basic idea is to


continue growing the given cluster as long as the density in the
neighborhood exceeds some threshold, i.e., for each data point within a
given cluster, the radius of a given cluster has to contain at least a
minimum number of points.

L1 CO1 Unit -3 List some advantages & disadvantages of partitional method.


Difficult/
77 Suppose we are given a database of ‘n’ objects and the partitioning
partitiona
l method method constructs ‘k’ partition of data. Each partition will represent a
cluster and k ≤ n. It means that it will classify the data into k groups,
which satisfy the following requirements −

 Each group contains at least one object.

 Each object must belong to exactly one group.

L2 CO1 List some advantages & disadvantages of density based method.


This method is based on the notion of density. The basic idea is to
Unit -3
Difficult/ continue growing the given cluster as long as the density in the
78 density neighborhood exceeds some threshold, i.e., for each data point within a
based given cluster, the radius of a given cluster has to contain at least a
method
minimum number of points.

L2 CO1 List some advantages & disadvantages of hierarichal method.


Ans:
This method creates a hierarchical decomposition of the given set of data
objects. We can classify hierarchical methods on the basis of how the
hierarchical decomposition is formed. There are two approaches here −
Unit -3
Difficult/
79 hierarich  Agglomerative Approach
al
 Divisive Approach
method

advantages
1. It does not assume a particular value of 𝑘, as needed by 𝑘-means clustering.
2. The generated tree may correspond to a meaningful taxonomy.
3. Only a distance or “proximity” matrix is needed to compute the hierarchical
clustering.
L3 CO1 Unit -3
Difficult/s
plit
80
algorithm
based on How gini index and information theory are different from each other. Explain
gini index with the help of suitable examples.
L1 CO1 Unit -3
Difficult/
dealing Enlist the various ways to deal with large databases. Illustrate any real life
81
with large example.
database
s Get answer from Q 61.
L2 CO2 Unit -3
Difficult/
82
decision Elaborate type of attributes does decision tree approach work for? Explain the
tree difference between Training set and Test set?
L2 Co2 Unit -3
Difficultin
troductio
n to web
83
data
mining & Elaborate the concepts with real life examples:
search a. Web data mining
engines b. Search engines
L3 CO3 Unit -3
Difficult/
estimatin
g
predictive
84
accuracy
of
classificat
ion
method How can we estimate the predictive accuracy of classification method.
L2 CO2 Unit -3 Enlist some real life examples of clustering. What are the various methods to do
85
Difficult/c the analysis of clustering.
lustering,
methods Get answer from Q61
L3 CO2 Unit -3
Difficult/
86
decision
tree How the algorithm for constructing a decision tree from training samples works
L2 CO3 Unit -3 Elaborate the clustering methods in detail:
87 Difficult/c (i) BIRCH
lustering (ii) CURE
L2 CO3 List the various advantages and disadvantages of decision tress over other
classification methods?

Advantages
 Easy to Understand: Decision tree output is very easy to understand even for
people from non-analytical background. It does not require any statistical
knowledge to read and interpret them. Its graphical representation is very
intuitive and users can easily relate their hypothesis.
 Useful in Data exploration: Decision tree is one of the fastest way to identify
most significant variables and relation between two or more variables. With the
help of decision trees, we can create new variables / features that has better power
to predict target variable. You can refer article (Trick to enhance power of
regression model) for one such trick. It can also be used in data exploration stage.
For example, we are working on a problem where we have information available
Unit -3 in hundreds of variables, there decision tree will help to identify most significant
Difficult/s variable.
88 decision  Less data cleaning required: It requires less data cleaning compared to
tree & some other modeling techniques. It is not influenced by outliers and missing
clustering values to a fair degree.
 Data type is not a constraint: It can handle both numerical and categorical
variables.
 Non Parametric Method: Decision tree is considered to be a non-parametric
method. This means that decision trees have no assumptions about the space
distribution and the classifier structure.

Disadvantages
 Over fitting: Over fitting is one of the most practical difficulty for decision tree
models. This problem gets solved by setting constraints on model parameters and
pruning (discussed in detailed below).
 Not fit for continuous variables: While working with continuous numerical
variables, decision tree looses information when it categorizes variables in
different categories.

L3 CO2 Unit -3
Difficult/c
89
luster
software Elaborate the term cluster software with the help of examples.
L2 CO3 Unit -3
Difficult/c
90
luster
analysis Give the keypoints related to requirements of cluster analysis? Explain in detail.

Вам также может понравиться