Академический Документы
Профессиональный Документы
Культура Документы
Level Question
Sr.
in BT Identify location/ Question
No
CO Topic
L1 CO1 Unit -
1/introdu
1 Why do many enterprises need a data warehouse? Give two major components of
ction
any data warehouse system.
L1 CO1 Unit -1 Explain why ETL must deal with dirty data when extracting information from the
2
/ETL source systems.
L1 CO1 Unit -
3 1/ETL
What is ETL, List the major steps involved in the ETL process
L2 CO1 Unit -
1/multidi
mensiona
4
l view and
data cube
Explain multidimensional view and various data cube operations.
L2 CO1 Unit -
5 1/OLAP
Difference between OLAP & OLTP.
L2 CO2 Unit -
1/star
6 and
snowflake
Explain the difference between star schema & snow flake schema.
L1 CO2 Unit -
1/Fact
7 constellat
ion
List the various advantages & disadvantages of fact constellation schema.
L2 CO2 Unit -
1/olap
operation
in
8 multidime
nsional
data
model Enlist the various OLAP Operations in the Multidimensional Data Model.
L3 CO2 Unit - Elaborate the concept of Starnet Query Model for Querying Multidimensional
9 1/Star Databases.
schema
L3 CO1 Unit -
1/data
10 warehous
e
List some of the design guideleines for data warehouse implementation
L3 CO2 Unit -
2/apriori
11
algorithm
What is Apriori algorithm &name some variants of Apriori Algorithm.
L2 CO2 Unit -2/
naïve
12
algorithm
Explain Naive Baye’s Classification.
L3 CO2 Unit -
2/direct
hashing
13
and
pruning What are the various factors that are affecting the complexity of direct hashing and
pruning.
L2 CO2 Unit -
2/dynami
14 c itemset
counting
Explain the various operations Dynamic Itemset counting(DIC).
L2 CO2 Unit -
2/frequen
t pattern
without
15
candidate
generatio
n
Elaborate the concept of FP growth with the help of example.
L2 CO2 Unit -
16 2/mining What are the various association rules of data mining.Explain with the help of
example.
L1 CO1 Unit -2
/need for
17 data
preproces
sing Why there is a need for data preprocessing.
L1 CO3 Unit -
2/data
18
cleaning
Describe different data cleaning approaches.
L2 CO3 Unit -2
Data
integratio
n&transfo
rmation,d
19 ata
reduction
, data Explain the following terms:
discretiza i. Data Integration and Transformation
tion & ii. Data Reduction
concept iii. Data Discretization and Concept Hierarchy Generation.
hierarchy
generatio
n
L3 CO3 Unit -
2/perfor
mance
evaluatio
20
n of
algorithm
s
How the evaluation of various algorithms being accessed.
L1 CO3 Enlist the classification techniques of data mining.
Unit-3/ Ans:
introducti
on to web
21 Building the Classifier or Model
data
mining
Using Classifier for Classification
Unit-
3/decisio
22
n tree
It works for both categorical and continuous input and output variables. In this
technique, we split the population or sample into two or more homogeneous
sets (or sub-populations) based on most significant splitter / differentiator
in input variables.
L2 CO3 What do you mean by cluster analysis & list the various requirements of it.
Unit-3
Requirements:
/cluster
26 analysis Scalability
introducti
on Ability to deal with different kinds of attributes
High dimensionality
L2 CO3 Unit-
3/partitio
27 nal
methods
What are the various advantages& disadvantages of partitional methods
L2 CO3 Difference between hierrarchical methods& density based methods
Hierarchical Methods
L3 CO3 Unit-3
/dealing
29
with large
databases What are some efficient ways to perform K-means on large datasets.
L3 CO3 How clustering is useful in web data mining.
Ans:
Unit-
Clustering also helps in classifying documents on the web for
3/cluster
30 information discovery.
software
As a data mining function, cluster analysis serves as a tool to
gain insight into the distribution of data to observe
characteristics of each cluster.
L2 CO2 Unit -1
Difficult/d
ata cube
14
& data Define the terminology used in data cube. Use an Example to illustrate the use of a
cube data cube.
operation
L2 CO2 Unit -1
Difficultst
15 ar &
snowflake Describe how a data warehouse is modeled and implemented using the star schema
schema and snowflake schema. Explain using an example.
L3 CO2 Unit -1
Difficult/d
16 ata
warehous
e Why partition is required in data warehousing? Explain with real life example.
L1 CO1 Unit -1
Difficult/
17 data
warehous Draw the complete architecture of data warehouses. what are the various stages
e involved in it.
L2 CO1 Unit -1
Difficult
18
operation How company/organization differ between data warehouse and operational
al data databases?
stores&
data
warehous
e
L1 CO1 Unit -1
19 Difficult/E Why ETL must deal with dirty data when extracting information from the source
TL systems.
L1 CO1 Unit -1
20 Difficult/d
ata cube Which operation is used to re-orient the view of data cube? Give example.
L1 CO1 Unit -1
Difficult/d
21 ata
warehous
e & ETL Illustrate the role of ETL in data warehouse with suitable diagram.
L2 CO1 Unit -1
Difficult/
fact
22 constellat
ion &
snowflake How fact constellation schema and snowflake schema. Explain with the help of
schema suitable examples.
L1 CO3 Unit -1
23 Difficult/ How can you define ROLAP and MOLAP? Describe these two approaches and list
OLAP their advantages and disadvantages.
L2 CO3 Unit -1 Write about operational Database Systems & Data Warehouse (OLTP & OLAP) in
24 Difficult/o detail with example.
lap &oltp
L1 CO3 Unit -1
Difficult/d
aata flows
25
in data What all are the data flows and various managers in data warehouse explain with
warehous with a suitable diagram.
e
L1 CO1 Unit -1
Difficult
26 ETL
architectu
re Enlist the various steps in Extraction-Transformation-loading with neat diagram?
L2 CO1 Unit -1
Difficult/
27
OLAP Give the names of storage models of OLAP? Explain the concepts with the help of
models example.
L1 CO1 Unit -1
Difficult/
multidime
28 nsional
view
&data Describe multidimensional view with the help of example & Frame the various data
cube cube operations.
L2 CO1 Unit -1
Difficult/s
tarnet
29
query
model for Enlist the types of queries do managers need to pose to the enterprise’s database
querying systems?
multidime
nsional
databases
L2 CO1 Unit -1
30 Difficult/ Are all data cube entries non-zero? If not, why not? Explain with the help of
data cube examples.
L1 CO1 Assuming minimum Support of 60 % and minimum confidence of 80% ,find all
frequent items and list all association rules using the Apriori Algorithm.
Unit -2 Transactions List of items
Average/ T100 K,A,D,B
31
apriori
T200 D,A,C,E,B
algorithm
T300 C,A,B,E
T400 B,A,D
L1 CO1 Unit -2
Average/
32
naïve Why Naïve’s algorithm is not a good example for large number of items? Justify your
algorithm answer with a suitable example.
L2 CO1 Unit -2
Average/
33
data Full form of KDD? How it is related to data mining? Explain the concept with suitable
mining example.
L3 CO1 Transaction ID Items
100 Bread,Milk,Juice,Cheese
200 Bread,Milk,Cheese
Unit -2 300 Bread,Milk
Average/
34 400 Bread,Juice,Cheese
apriori
algorithm a) use the Apriori algorithm for finding frequently item sets with minimum support
90% and minimum confidence of 90%
L1 CO1 Suppose that a data warehouse consists of the three dimensions time, doctor, and
patient, and the two measures count and charge, where charge is the fee that a
Unit -2 doctor charges a patient for a visit.
45 Difficult/s (a) Enumerate three classes of schemas that are popularly used for modeling data
chema warehouses.
(b) Draw a schema diagram for the above data warehouse using one of the schema
classes listed in (a).
L1 CO2 Unit -2 Elaborate the method of generating frequent item sets without candidate
46 Difficult/F generation.
P growth
L2 CO3 Unit -2
Difficult/d
47
ata
mining List out some social Impact of Data mining .
L1 CO1 Unit -2
Difficult/a
48 priori &
naïve How apriori & naïve algorithm differs. Which is better? Illustrate the concept with an
algorithm example
L1 CO2 Unit -2
Difficult/s
49
nowflake
schema Elaborate the various Steps required to design the snowflake schema with example.
L1 CO2 Unit -2
Difficult/p
erforman
ce
50
evaluattio
n of
algorithm
s What are the various factors which are affecting the complexity of algorithms.
L2 CO2 Unit -2
Difficult/n
eed of
51
pre
processin
g List the various tasks to be accomplished as part of data pre-processing.
L1 CO2 Unit -2
Difficult/d
52 ata
integratio
n List the issues to be considered during data integration.
L1 CO3 Unit -2
Difficult/
53
data
mining Write the real life applications of data mining.
L2 CO2 Unit -2
Difficult/s
oftware
54 for
associatio
n rule How association rule mining is used in industry. What are the various softwares used
mining for association rule mining.
L3 CO2 Consider the Data set D. Given the minimum support2, apply apriori algorithm on
this dataset.
L1 CO1 Unit -2 How a data Warehouse is modeled and implemented using the star schema and the
56
Difficult snowflake schema. Explain using an Example.
L1 CO3 Unit -2
Difficult/d
57
ata
reduction Enlist the various strategies for data reduction. Explain?
L1 CO3 Unit -2
Difficult/d
58
ata
cleaning How do you clean the data? Explain.
L1 CO3 Unit -2
Difficult/s
oftware
59 for
associatio
n rule
mining How are association rules mined from large databases.
L2 CO2 Unit -2
Difficult/a
60
priori
algorithm List the techniques to improve the efficiency of Apriori algorithm.
L1 CO2 What are the various methods to deal with large databases. explain with the help
of real life examples
Ans:
Partitioning Method
Hierarchical Method
Density-based Method
Grid-Based Method
Model-Based Method
Constraint-based Method
Partitioning Method
Hierarchical Methods
Agglomerative Approach
Divisive Approach
Density-based Method
In this, the objects together form a grid. The object space is quantized
into finite number of cells that form a grid structure.
Advantages
Model-based methods
In this method, a model is hypothesized for each cluster to find the best
fit of data for a given model. This method locates the clusters by
clustering the density function. It reflects spatial distribution of the data
points.
Constraint-based Method
L1 CO2 What is classification? List some criteria for evaluating classification methods.
Discuss them Briefly.
Unit -3
Average/ Ans:
62
classificat
ion Classification is a forms of data analysis that can be used for
extracting models describing important classes or to predict
future data trends. Classification models predict categorical class
labels.
Classification Methods:
Genetic Algorithms
Ans:
Accuracy − Accuracy of classifier refers to the ability of
Unit -3
classifier. It predict the class label correctly and the accuracy of
Average/
the predictor refers to how well a given predictor can guess the
predictive
value of predicted attribute for a new data.
accuracy
63
of
classificat Methods:
ion Hold out
method Random sub sampling
K-fold
Cross-validation
Leave method
Bootstrap method.
L3 CO2 What kind of data is the decision tree method most suitable for? Briefly outline
the major steps of the algorithm to construct a decision tree? Explain each step.
Ans:
A decision tree is a structure that includes a root node,
branches, and leaf nodes. Each internal node denotes a test on
an attribute, each branch denotes the outcome of a test, and
each leaf node holds a class label. The topmost node in the tree
is the root node.
It is easy to comprehend.
Algorithm : Generate_decision_tree
Input:
Data partition, D, which is a set of training tuples
and their associated class labels.
attribute_list, the set of candidate attributes.
Attribute selection method, a procedure to determine the
splitting criterion that best partitions that the data
tuples into individual classes. This criterion includes a
splitting_attribute and either a splitting point or splitting subset.
Output:
A Decision Tree
Method
create a node N;
if Dj is empty then
attach a leaf labeled with the majority
class in D to node N;
else
attach the node returned by Generate
decision tree(Dj, attribute list) to node N;
end for
return N;
L3 CO2 Unit -3
65
Average
L3 CO3 Smooth the given data 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34 on the basis of equi-
depth, partitioning on the basis of boundaries and on the basis of mean
Unit -3
Average/
66 Partition into (equi-depth) bins:
partitiona
-Bin 1: 4, 8, 9, 15
l methods
-Bin 2: 21, 21, 24, 25
-Bin 3: 26, 28, 29, 34
* Smoothing by bin means:
-Bin 1: 9, 9, 9, 9
-Bin 2: 23, 23, 23, 23
-Bin 3: 29, 29, 29, 29
* Smoothing by bin boundaries:
-Bin 1: 4, 4, 4, 15
-Bin 2: 21, 21, 25, 25
-Bin 3: 26, 26, 26, 34
L2 CO2 To find frequent item sets which algorithm is best suitable and explain how it
works.
Ans:
Input:
Data partition, D, which is a set of training tuples
and their associated class labels.
Unit -3 attribute_list, the set of candidate attributes.
Average/ Attribute selection method, a procedure to determine the
splitting criterion that best partitions that the data
67 tree tuples into individual classes. This criterion includes a
induction splitting_attribute and either a splitting point or splitting subset.
algorithm
Output:
A Decision Tree
Method
create a node N;
if Dj is empty then
attach a leaf labeled with the majority
class in D to node N;
else
attach the node returned by Generate
decision tree(Dj, attribute list) to node N;
end for
return N;
L2 CO3 How attributes selected for a split based on the Gini index?
Ans:
Unit -3
Average/
split
68
algorithm
based on
gini index
L3 CO2 How bootstrapping, bagging and boosting improve the accuracy of classifications.
Ans:
To understand bootstrap, suppose it were possible to draw repeated samples (of the same
size) from the population of interest, a large number of times. Then, one would get a fairly
good idea about the sampling distribution of a particular statistic from the collection of its
values arising from these repeated samples. But, that does not make sense as it would be too
expensive and defeat the purpose of a sample study. The purpose of a sample study is to
gather information cheaply in a timely fashion. The idea behind bootstrap is to use the data of
a sample study at hand as a “surrogate population”, for the purpose of approximating the
sampling distribution of a statistic; i.e. to resample (with replacement) from the sample data at
Unit -3 hand and create a large number of “phantom samples” known as bootstrap samples. The
Average/ sample summary is then computed on each of the bootstrap samples (usually a few thousand).
69 A histogram of the set of these computed values is referred to as the bootstrap distribution of
classificat
ions the statistic.
In other words, We randomly sample with replacement from the n known observations. We
then call this a bootstrap sample. Since we allow for replacement, this bootstrap sample most
likely not identical to our initial sample. Some data points may be duplicated, and others data
points from the initial may be omitted in a bootstrap sample.
An Example
The following numerical example will help to demonstrate how the process works. If we begin
with the sample 2, 4, 5, 6, 6, then all of the following are possible bootstrap samples:
2 ,5, 5, 6, 6
4, 5, 6, 6, 6
2, 2, 4, 5, 5
2, 2, 2, 4, 6
2, 2, 2, 2, 2
4,6, 6, 6, 6
Boosting and Bagging.
Bootstrap aggregating (bagging) is a machine learning ensemble meta-algorithm designed to
improve the stability and accuracy of machine learning algorithms used in statistical
classification and regression. It also reduces variance and helps to avoid overfitting. Although it
is usually applied to decision tree methods, it can be used with any type of method.
Both boosting and bagging are ensemble techniques — instead of learning a single classifier,
several are trained and their predictions combined. While bagging uses an ensemble of
independently trained classifiers, boosting is an iterative process that attempts to mitigate
prediction errors of earlier models by predicting them with later models.
L1 CO2 Unit -3
70
Average/ What are the various types of classification.explain with the help of example
L3 CO2 What will be the class value when X= (Rain, Cool,High, Strong)?
Unit -3
Difficult/c
71 luster
analysis
L2 CO2 How Root node is evaluated in Decision tree. Explain the Decision Tree induction
algorithm.
Ans:
Input:
Data partition, D, which is a set of training tuples
and their associated class labels.
attribute_list, the set of candidate attributes.
Attribute selection method, a procedure to determine the
splitting criterion that best partitions that the data
tuples into individual classes. This criterion includes a
splitting_attribute and either a splitting point or splitting subset.
Output:
A Decision Tree
Method
create a node N;
if Dj is empty then
attach a leaf labeled with the majority
class in D to node N;
else
attach the node returned by Generate
decision tree(Dj, attribute list) to node N;
end for
return N;
L1 CO2 How partitional & hierarichal clustering methods differ from one another.
Ans:
Hierarchical and Partitional Clustering have key differences in running
time, assumptions, input parameters and resultant clusters.
Typically, partitional clustering is faster than hierarchical clustering.
Unit -3 Hierarchical clustering requires only a similarity measure, while partitional
Difficult/c clustering requires stronger assumptions such as number of clusters and
74
lustering the initial centers.
methods Hierarchical clustering does not require any input parameters, while
partitional clustering algorithms require the number of clusters to start
running.
Hierarchical clustering returns a much more meaningful and subjective
division of clusters but partitional clustering results in exactly k clusters.
Hierarchical clustering algorithms are more suitable for categorical data as
long as a similarity measure can be defined accordingly.
L1 CO2 Give the name of method to calculate Gain in case of decision tree. What is
entropy? How it can be calculated in case of Decision tree?
Ans:
Entropy
A decision tree is built top-down from a root node and involves partitioning the da
instances with similar values (homogenous). ID3 algorithm uses entropy to calcula
the sample is completely homogeneous the entropy is zero and if the sample is an
one.
Unit -3 To build a decision tree, we need to calculate two types of entropy using frequenc
Difficult/
75
decision
tree a) Entropy using the frequency table of one attribute:
b) Entropy usi
attributes:
L1 CO2 Which is better among partitional, hierarichal & density based methods and why.
Ans:
Partitioning Method
Divisive Approach
Density-based Method
advantages
1. It does not assume a particular value of 𝑘, as needed by 𝑘-means clustering.
2. The generated tree may correspond to a meaningful taxonomy.
3. Only a distance or “proximity” matrix is needed to compute the hierarchical
clustering.
L3 CO1 Unit -3
Difficult/s
plit
80
algorithm
based on How gini index and information theory are different from each other. Explain
gini index with the help of suitable examples.
L1 CO1 Unit -3
Difficult/
dealing Enlist the various ways to deal with large databases. Illustrate any real life
81
with large example.
database
s Get answer from Q 61.
L2 CO2 Unit -3
Difficult/
82
decision Elaborate type of attributes does decision tree approach work for? Explain the
tree difference between Training set and Test set?
L2 Co2 Unit -3
Difficultin
troductio
n to web
83
data
mining & Elaborate the concepts with real life examples:
search a. Web data mining
engines b. Search engines
L3 CO3 Unit -3
Difficult/
estimatin
g
predictive
84
accuracy
of
classificat
ion
method How can we estimate the predictive accuracy of classification method.
L2 CO2 Unit -3 Enlist some real life examples of clustering. What are the various methods to do
85
Difficult/c the analysis of clustering.
lustering,
methods Get answer from Q61
L3 CO2 Unit -3
Difficult/
86
decision
tree How the algorithm for constructing a decision tree from training samples works
L2 CO3 Unit -3 Elaborate the clustering methods in detail:
87 Difficult/c (i) BIRCH
lustering (ii) CURE
L2 CO3 List the various advantages and disadvantages of decision tress over other
classification methods?
Advantages
Easy to Understand: Decision tree output is very easy to understand even for
people from non-analytical background. It does not require any statistical
knowledge to read and interpret them. Its graphical representation is very
intuitive and users can easily relate their hypothesis.
Useful in Data exploration: Decision tree is one of the fastest way to identify
most significant variables and relation between two or more variables. With the
help of decision trees, we can create new variables / features that has better power
to predict target variable. You can refer article (Trick to enhance power of
regression model) for one such trick. It can also be used in data exploration stage.
For example, we are working on a problem where we have information available
Unit -3 in hundreds of variables, there decision tree will help to identify most significant
Difficult/s variable.
88 decision Less data cleaning required: It requires less data cleaning compared to
tree & some other modeling techniques. It is not influenced by outliers and missing
clustering values to a fair degree.
Data type is not a constraint: It can handle both numerical and categorical
variables.
Non Parametric Method: Decision tree is considered to be a non-parametric
method. This means that decision trees have no assumptions about the space
distribution and the classifier structure.
Disadvantages
Over fitting: Over fitting is one of the most practical difficulty for decision tree
models. This problem gets solved by setting constraints on model parameters and
pruning (discussed in detailed below).
Not fit for continuous variables: While working with continuous numerical
variables, decision tree looses information when it categorizes variables in
different categories.
L3 CO2 Unit -3
Difficult/c
89
luster
software Elaborate the term cluster software with the help of examples.
L2 CO3 Unit -3
Difficult/c
90
luster
analysis Give the keypoints related to requirements of cluster analysis? Explain in detail.