Академический Документы
Профессиональный Документы
Культура Документы
1. Data mining Discover hidden Knowledge Do you agree? Justify your answer. What
are the differences between database and data mining?
Yes. I am agree that Data Mining Discover Hidden Knowledge.
It is generally accepted that the reason for capturing and storing large amounts of data is due
to the belief that there is valuable information implicitly coded within it. An important issue
is therefore how is this hidden information (if it exists at all) to be revealed? Traditional
methods of knowledge generation rely largely upon manual analysis and interpretation.
However, as data collections continue to grow in size and complexity, there is a
corresponding growing need for more sophisticated techniques of analysis. One such
innovative approach to the knowledge discovery process is known as data mining.
Data mining is essentially the computer-assisted process of information analysis. This can be
performed using either a top-down or a bottom-up approach. Bottom-up data mining analyses
raw data in an attempt to discover hidden trends and groups, whereas the aim of top-down
data mining is to test a specific hypothesis. Data mining may be performed using a variety of
techniques, including intelligent agents, powerful database queries, and multi-dimensional
analysis tools. Multi-dimensional analysis tools include the use of neural networks, as
described in this work.
The data mining approach expedites the initial stages of information analysis, thereby quickly
providing initial feedback that may be further and more thoroughly investigated if
appropriate. The results obtained are not (unless otherwise specified) influenced by
preconceptions of the semantics of the data undergoing analysis. Patterns and trends may
therefore be revealed that may otherwise remain undetected, and/or not considered.
What is the difference between DBMS and Data mining?
DBMS is a full-fledged system for housing and managing a set of digital databases. However
Data Mining is a technique or a concept in computer science, which deals with extracting
useful and previously unknown information from raw data. Most of the times, these raw data
are stored in very large databases. Therefore Data miners use the existing functionalities of
DBMS to handle manage and even preprocess raw data before and during the Data mining
process. However, a DBMS system alone cannot be used to analyze data. But, some DBMS
at present have inbuilt data analyzing tools or capabilities.
1
Prepared By: Ujjal Bhowmik
2
Prepared By: Ujjal Bhowmik
3. Justify All strong association rules are not necessarily interesting with example.
Whether a rule is interesting or not can be assessed either subjectively or objectively
Objective interestingness measures can be used as one step toward the goal of finding
interesting rules for the user
Example of a misleading strong association rule
Analyze transactions of All Electronics data about computer games and videos
Of the 10,000 transactions analyzed
6,000 of the transactions include computer games
7,500 of the transactions include videos
4,000 of the transactions include both
Suppose that min_sup=30% and min_confidence=60%
The following association rule is discovered:
Buys(X, computer games) buys(X, videos)[support =40%, confidence=66%]
This rule is strong but it is misleading
The probability of purshasing videos is 75% which is even larger than 66%
In fact computer games and videos are negatively associated because the purchase of
one of these items actually decreases the likelihood of purchasing the other
The confidence of a rule A B can be deceiving
--It is only an estimate of the conditional probability of itemset B given itemset A.
--It does not measure the real strength of the correlation implication between A and B
Need to use Correlation Analysis
3
Prepared By: Ujjal Bhowmik
4. Mr. Kamal Hossain, manager of Spicy Pickle is interested to find out the correlation
between his two most sold products namely hamburger and hot dogs. Mr. Kamal
analyzes his data base and find the following statistics about the two items.
Hot dogs Hot dogs
row
Hamburgers
2,000
500
2,500
Hamburgers
1,000
1,500
2,500
col
3,000
2,000
5,000
i. Suppose that the association rule buys(X, hot dogs)buys(X, hamburgers) is mined.
Given a minimum support threshold of 25% and minimum confidence threshold of 50%, Is
this association rule strong?
ii. How can you analyze correlation between these two items using the lift, cosine, and
all_confidence measures?
SOLUTION NEEDED
From Note:
(a) Suppose that the association rule hot dogs hamburgers is mined. Given a
minimum support threshold of 25% and a minimum confidence threshold of 50%, is
this association rule strong?
(b) Based on the given data, is the purchase of hot dogs independent of the purchase of
hamburgers? If not, what kind of correlation relationship exists between the two?
Answer: (a)
For the rule, support = 2000/5000 = 40%, and confidence = 2000/3000 = 66.7%. Therefore,
the association rule is strong.
(b) Based on the given data, is the purchase of hotdogs independent of the purchase of
hamburgers? If not, what kind of correlation relationship exists between the two?
corr{hotdog,hamburger} = P({hot dog, hamburger})/(P({hot dog})P({hamburger}))
= 0.4 / (0.5 0.6) = 1.33 > 1.
So, the purchase of hotdogs is not independent of the purchase of hamburgers. There exists
a positive correlation between the two.
4
Prepared By: Ujjal Bhowmik
05. What do you mean by KDD? What are the steps pertaining to knowledge discovery
process?
The term Knowledge Discovery in Databases, or KDD for short, refers to the broad process
of finding knowledge in data, and emphasizes the "high-level" application of particular data
mining methods. The unifying goal of the KDD process is to extract knowledge from data
in the context of large databases.
Some people dont differentiate data mining from knowledge discovery, while others view
data mining as an essential step in the process of knowledge discovery. Here is the list of
steps involved in the knowledge discovery process
Data Cleaning In this step, the noise and inconsistent data is removed.
Data Integration In this step, multiple data sources are combined.
Data Selection In this step, data relevant to the analysis task are retrieved from the
database.
Data Transformation In this step, data is transformed or consolidated into forms
appropriate for mining by performing summary or aggregation operations.
Data Mining In this step, intelligent methods are applied in order to extract data patterns.
Pattern Evaluation In this step, data patterns are evaluated.
Knowledge Presentation In this step, knowledge is represented.
The following diagram shows the process of knowledge discovery
5
Prepared By: Ujjal Bhowmik
to find unknown patterns or rules of information that one can use to tailor business
operations. Data mining predicts future trends and behaviors, allowing businesses to make
proactive, knowledge driven decisions.
6
Prepared By: Ujjal Bhowmik
7
Prepared By: Ujjal Bhowmik
8
Prepared By: Ujjal Bhowmik
Consider the following database containing five transactions. Let min_sup = 60%.
TID
Transaction
T100 a, c, d, f, g, i, m, p
T200 a, b, c, f, l, m, o
T300 b, f, h, j, o
T400 b, c, k, p, s
T500 a, c, e, f, l, m, n, p
Find out all the frequent itemsets using FP Growth Algorithm.
9
Prepared By: Ujjal Bhowmik
10. Mr. Abdul, owner of a super market would like to find frequent item sets of his products
sold every day in his market. His employees always maintain a data base where each
customers buying information is kept against a customers identification number. The data
base is illustrated below. Now your task is to help him to find all frequent item sets with
minimum support threshold value of 60% using FP growth mining algorithm.
TID Transaction
T100 M, O,N,K,E,Y
T200 D,O,N,K,E,Y
T300 M,A,K,E
T400 M,U,C,K,Y
T500 C,O,O,K,I,E
OR,
(a) Find all frequent itemsets using Apriori and FP-growth, respectively. Compare the
efficiency of the two mining processes.
Apriori:
FP-growth:
See below Figure for the FP-tree.
Efficiency comparison: Apriori has to do multiple scans of the database while FP-growth
builds the FP-Tree with a single scan. Candidate generation in Apriori is expensive (owing to
the self-join), while FP-growth does not generate any candidates.
10
Prepared By: Ujjal Bhowmik
Consider the following database and find out the support and confidence of
{Milk, Diaper} Beer
11
Prepared By: Ujjal Bhowmik
12. Suppose, in a diagnostic center, blood test of 10 persons is accomplished, the amount
of calcium in their blood is found as follows 2mg/dl, 5 mg/dl, 6 mg/dl, 10mg/dl, 20mg/dl,
6mg/dl, 18mg/dl, 9mg/dl, 5mg/dl, 1mg/dl. Now find out those persons whose blood
calcium is unusual using Quartile method.
SOLUTION NEEDED
12
Prepared By: Ujjal Bhowmik
13. State the advantages and disadvantages among mining algorithms (Apriori and FP
Growth Algorithm). Which mining algorithm is better? You think and justify your
answer.
Advantages of Apriori:
The Apriori Algorithm calculates more sets of frequent items.
Disadvantages of Apriori:
The candidate generation could be extremely slow (pairs, triplets, etc.).
The candidate generation could generate duplicates depending on the implementation.
The counting method iterates through all of the transactions each time.
Constant items make the algorithm a lot heavier.
Huge memory consumption
FP-Growth Biggest Advantages:
The biggest advantage found in FP-Growth is the fact that the algorithm only needs to read
the file twice, as opossed to apriori who reads it once for every iteration.
Another huge advantage is that it removes the need to calculate the pairs to be counted, which
is very processing heavy, because it uses the FP-Treee. This makes it O(n) which is much
faster than apriori.
The FP-Growth algorithm stores in memory a compact version of the database.
FP-Growth Bottlenecks:
The biggest problem is the interdependency of data. The interdependency problem is that for
the parallelization of the algorithm some that still needs to be shared, which creates a
bottleneck in the shared memory.
Apriori vs FP-Growth:
Algorithm Technique
Runtime
Memory
usage
Candidate generation is
extremely slow. Runtime
increases exponentially
depending on the number
of different items.
Saves
singletons,
Candidate generation
pairs, triplets, is very parallelizable
etc.
Apriori
Generate
singletons,
pairs, triplets,
etc.
FPGrowth
Insert sorted
Runtime increases linearly,
items by
depending on the number
frequency into a
of transactions and items
pattern tree
Parallelizability
Stores a
Data are very inter
compact
dependent, each node
version of the
needs the root.
database.
Conclusions:
FP-Growth beats Apriori by far. It has less memory usage and less runtime. The differences
are huge. FP-Growth is more scalable because of its linear running time.
Don't think twice if you have to make a decision between these algorithms. Use FP-Growth.
13
Prepared By: Ujjal Bhowmik
14. Bangladesh is a riverine country. In this country, once upon a time the best
communication medium was water. But now Bangladesh is a developing country and it
has developed its transportation system. A lot of long bridge and culvert has been built
for the last decades. Although people of Bangladesh do not solely dependent on the
water path, yet the people of southern part of the country prefer launch journey to
move from one place to another. To travel by launch is very much risk if it is dense
foggy, higher level of water and higher the water current. Below a database of different
situation of the river is given. Your task is to categories the day when Fog= Dense,
Depth = High, Current = 6 using nave Decision tree method.
ID
1
2
3
4
5
6
7
8
Fog
Dense
Sparse
Medium
Dense
Sparse
Medium
Dense
Sparse
Depth
low
Low
Medium
Medium
High
High
High
Medium
Current
7
9
5
4
3
12
4
2
Status
Risky
Risky
Safe
Safe
Safe
Safe
Risky
Safe
SOLUTION NEEDED
14
Prepared By: Ujjal Bhowmik
15. Online banking is growing popular in Bangladesh for making transaction fast. Suppose
you are a banker and every day many persons apply for the loan to you. You need to
categorize the loan applicants who apply for the loan through online as safe and risky
category. In order to do it you are given a training data. Your task is to make a classifier
and identify the class of the person whose age is medium and income is high. Let the
class attribute of the training data be status.
ID
1
2
3
4
5
6
Age
Youth
Youth
Middle_aged
Middle_aged
Senior
Senior
Income
Low
Low
Medium
Medium
High
High
Status
Risky
Risky
Safe
Safe
Safe
Safe
SOLUTION NEEDED
15
Prepared By: Ujjal Bhowmik
16
Prepared By: Ujjal Bhowmik
17. Suppose that a data mining task is to cluster the following six points (with (x, y)
representing location) A1(4,6), A2(2,5), A3(9,3), A4(6,9) ,A5(7,5), A6(5,7). Find the
Divisive, K-means, Agglomerative, K-nearest neighbors method to classify the above
data.
SOLUTION NEEDED
18. Suppose Jagannath University would like to form three Foot Ball team named Big,
Medium and Small Foot Ball team based on their height with all students studying here.
Now help the authority to form the teams by writing an appropriate clustering
algorithm.
SOLUTION NEEDED
17
Prepared By: Ujjal Bhowmik
Supervised learning:
All data is labeled and the algorithms learn to predict the output from the input data.
Supervised learning is the Data mining task of inferring a function from labeled training
data .The training data consist of a set of training examples. In supervised learning, each
example is a pair consisting of an input object (typically a vector) and a desired output value
(also called the supervisory signal). A supervised learning algorithm analyzes the training
data and produces an inferred function, which can be used for mapping new examples. An
optimal scenario will allow for the algorithm to correctly determine the class labels
for unseen instances. This requires the learning algorithm to generalize from the training
data to unseen situations in a reasonable way.
Unsupervised learning:
Supervised learning: Learning from the know label data to create a model then predicting target
class for the given input data.
Unsupervised learning: Learning from the unlabeled data to differentiating the given input data.
All data is unlabeled and the algorithms learn to inherent structure from the input data.
In Data mining, the problem of unsupervised learning is that of trying to find hidden
structure in unlabeled data. Since the examples given to the learner are unlabeled, there is no
error or reward signal to evaluate a potential solution.
Themodelisnotprovidedwiththecorrectresults duringthetraining.
18
Prepared By: Ujjal Bhowmik
20. Suppose that a data mining task is to cluster the following six points (with (x, y)
representing location) A1(4,6), A2(2,5 A3(9,3), A4(6,9), A5(7,5), A6(5,7)
Suppose initially we assign A, A2 and A3 as the seeds of three cluster that we wish to
find. Find the K-means method to classify the above data.
SOLUTION NEEDED
19
Prepared By: Ujjal Bhowmik
20
Prepared By: Ujjal Bhowmik
22. The following table shows the midterm and final exam grades obtained for students
in a database course.
Mid
Final
Term
Exam
72
84
50
63
81
77
74
78
94
90
86
75
59
49
83
79
65
77
33
52
88
74
81
90
I.
II.
SOLUTION NEEDED
23.
24.
21
Prepared By: Ujjal Bhowmik
Sequence
(1), (2, 3)
(1, 2, 3)
STEP-1:
Make the first pass over the sequence database D to yield all the 1-element frequent
sequences.
Candidate 1-sequences are: <{1}>, <{2}>, <{3}>
STEP- 2A:
Candidate Generation: Merge pairs of frequent subsequences found in the (k-1)th pass to
generate candidate sequences that contain k items
Candidate 1-sequences are: <{1}>, <{2}>, <{3}>
Base case (k=2): Merging two frequent 1-sequences <{i 1 }> and <{i 2 }> will produce two
candidate 2-sequences: <{i 1 } {i 2 }> and <{i 1 i 2 }>
22
Prepared By: Ujjal Bhowmik
Support
<{1, 2}>
<{1, 3}>
<{2, 3}>
<{1}, {1}>
<{1}, {2}>
<{1}, {3}>,
<{2}, {1}>,
<{2}, {2}>,
<{2}, {3}>,
<{3}, {1}>
<{3}, {2}>,
<{3}, {3}>
23
Prepared By: Ujjal Bhowmik
Candidate
Support
24
Prepared By: Ujjal Bhowmik
26. What is sequential pattern mining? How is it different from frequent item set
mining?
Sequential pattern mining is a topic of data mining concerned with finding statistically
relevant patterns between data examples where the values are delivered in a sequence. It is
usually presumed that the values are discrete, and thus time series mining is closely related,
but usually considered a different activity.
25
Prepared By: Ujjal Bhowmik
27. Explain the self-joining technique to generate candidate sequential pattern in GSP
algorithm giving example.
26
Prepared By: Ujjal Bhowmik
UJJAL
27
Prepared By: Ujjal Bhowmik
28
Prepared By: Ujjal Bhowmik