Вы находитесь на странице: 1из 28

UJJAL

Data Mining-CSE 5310

1. Data mining Discover hidden Knowledge Do you agree? Justify your answer. What
are the differences between database and data mining?
Yes. I am agree that Data Mining Discover Hidden Knowledge.
It is generally accepted that the reason for capturing and storing large amounts of data is due
to the belief that there is valuable information implicitly coded within it. An important issue
is therefore how is this hidden information (if it exists at all) to be revealed? Traditional
methods of knowledge generation rely largely upon manual analysis and interpretation.
However, as data collections continue to grow in size and complexity, there is a
corresponding growing need for more sophisticated techniques of analysis. One such
innovative approach to the knowledge discovery process is known as data mining.
Data mining is essentially the computer-assisted process of information analysis. This can be
performed using either a top-down or a bottom-up approach. Bottom-up data mining analyses
raw data in an attempt to discover hidden trends and groups, whereas the aim of top-down
data mining is to test a specific hypothesis. Data mining may be performed using a variety of
techniques, including intelligent agents, powerful database queries, and multi-dimensional
analysis tools. Multi-dimensional analysis tools include the use of neural networks, as
described in this work.
The data mining approach expedites the initial stages of information analysis, thereby quickly
providing initial feedback that may be further and more thoroughly investigated if
appropriate. The results obtained are not (unless otherwise specified) influenced by
preconceptions of the semantics of the data undergoing analysis. Patterns and trends may
therefore be revealed that may otherwise remain undetected, and/or not considered.
What is the difference between DBMS and Data mining?
DBMS is a full-fledged system for housing and managing a set of digital databases. However
Data Mining is a technique or a concept in computer science, which deals with extracting
useful and previously unknown information from raw data. Most of the times, these raw data
are stored in very large databases. Therefore Data miners use the existing functionalities of
DBMS to handle manage and even preprocess raw data before and during the Data mining
process. However, a DBMS system alone cannot be used to analyze data. But, some DBMS
at present have inbuilt data analyzing tools or capabilities.

1
Prepared By: Ujjal Bhowmik

Data Mining-CSE 5310

2. What are the various applications of data mining?


Web page analysis: from web page classification, clustering to PageRank & HITS
algorithms
Collaborative analysis & recommender systems
Basket data analysis to targeted marketing
Biological and medical data analysis: classification, cluster analysis, biological sequence
analysis, biological network analysis
Data mining and software engineering (e.g., IEEE Computer, Aug. 2009 issue)
From major dedicated data mining systems/tools (e.g., SAS, MS SQL-Server Analysis
Manager, Oracle Data Mining Tools) to invisible data mining
Here is the list of areas where data mining is widely used
Financial Data Analysis, Retail Industry, Telecommunication Industry, Biological Data
Analysis, Other Scientific Applications, Intrusion Detection.
For businesses, data mining is used to discover patterns and relationships in the data in order
to help make better business decisions. Data mining can help spot sales trends, develop
smarter marketing campaigns, and accurately predict customer loyalty. Specific uses of data
mining include:
>>Market segmentation - Identify the common characteristics of customers who buy the
same products from your company.
>>Customer churn - Predict which customers are likely to leave your company and go to a
competitor.
>>Fraud detection - Identify which transactions are most likely to be fraudulent.
>>Direct marketing - Identify which prospects should be included in a mailing list to obtain
the highest response rate.
>>Interactive marketing - Predict what each individual accessing a Web site is most likely
interested in seeing.
>>Market basket analysis - Understand what products or services are commonly purchased
together; e.g., beer and diapers.
>>Trend analysis - Reveal the difference between a typical customer this month and last.

2
Prepared By: Ujjal Bhowmik

Data Mining-CSE 5310

3. Justify All strong association rules are not necessarily interesting with example.
Whether a rule is interesting or not can be assessed either subjectively or objectively
Objective interestingness measures can be used as one step toward the goal of finding
interesting rules for the user
Example of a misleading strong association rule
Analyze transactions of All Electronics data about computer games and videos
Of the 10,000 transactions analyzed
6,000 of the transactions include computer games
7,500 of the transactions include videos
4,000 of the transactions include both
Suppose that min_sup=30% and min_confidence=60%
The following association rule is discovered:
Buys(X, computer games) buys(X, videos)[support =40%, confidence=66%]
This rule is strong but it is misleading
The probability of purshasing videos is 75% which is even larger than 66%
In fact computer games and videos are negatively associated because the purchase of
one of these items actually decreases the likelihood of purchasing the other
The confidence of a rule A B can be deceiving
--It is only an estimate of the conditional probability of itemset B given itemset A.
--It does not measure the real strength of the correlation implication between A and B
Need to use Correlation Analysis

3
Prepared By: Ujjal Bhowmik

Data Mining-CSE 5310

4. Mr. Kamal Hossain, manager of Spicy Pickle is interested to find out the correlation
between his two most sold products namely hamburger and hot dogs. Mr. Kamal
analyzes his data base and find the following statistics about the two items.
Hot dogs Hot dogs
row
Hamburgers
2,000
500
2,500
Hamburgers
1,000
1,500
2,500
col
3,000
2,000
5,000
i. Suppose that the association rule buys(X, hot dogs)buys(X, hamburgers) is mined.
Given a minimum support threshold of 25% and minimum confidence threshold of 50%, Is
this association rule strong?
ii. How can you analyze correlation between these two items using the lift, cosine, and
all_confidence measures?

SOLUTION NEEDED
From Note:
(a) Suppose that the association rule hot dogs hamburgers is mined. Given a
minimum support threshold of 25% and a minimum confidence threshold of 50%, is
this association rule strong?
(b) Based on the given data, is the purchase of hot dogs independent of the purchase of
hamburgers? If not, what kind of correlation relationship exists between the two?
Answer: (a)
For the rule, support = 2000/5000 = 40%, and confidence = 2000/3000 = 66.7%. Therefore,
the association rule is strong.
(b) Based on the given data, is the purchase of hotdogs independent of the purchase of
hamburgers? If not, what kind of correlation relationship exists between the two?
corr{hotdog,hamburger} = P({hot dog, hamburger})/(P({hot dog})P({hamburger}))
= 0.4 / (0.5 0.6) = 1.33 > 1.
So, the purchase of hotdogs is not independent of the purchase of hamburgers. There exists
a positive correlation between the two.

4
Prepared By: Ujjal Bhowmik

Data Mining-CSE 5310

05. What do you mean by KDD? What are the steps pertaining to knowledge discovery
process?
The term Knowledge Discovery in Databases, or KDD for short, refers to the broad process
of finding knowledge in data, and emphasizes the "high-level" application of particular data
mining methods. The unifying goal of the KDD process is to extract knowledge from data
in the context of large databases.
Some people dont differentiate data mining from knowledge discovery, while others view
data mining as an essential step in the process of knowledge discovery. Here is the list of
steps involved in the knowledge discovery process
Data Cleaning In this step, the noise and inconsistent data is removed.
Data Integration In this step, multiple data sources are combined.
Data Selection In this step, data relevant to the analysis task are retrieved from the
database.
Data Transformation In this step, data is transformed or consolidated into forms
appropriate for mining by performing summary or aggregation operations.
Data Mining In this step, intelligent methods are applied in order to extract data patterns.
Pattern Evaluation In this step, data patterns are evaluated.
Knowledge Presentation In this step, knowledge is represented.
The following diagram shows the process of knowledge discovery

5
Prepared By: Ujjal Bhowmik

Data Mining-CSE 5310

6. What is data warehousing? Explain the data warehousing architecture.


Data warehousing:
A data warehouse is a repository (or archive) of information gathered from multiple sources,
stored under a unified schema, at a single site.
In computing, a data warehouse (DW or DWH), also known as an enterprise data
warehouse (EDW), is a system used for reporting and data analysis, and is considered a core
component of business intelligence.
Data Warehouse Architecture:
Followings are the components of a data warehouse1:
Database Servers: Operational data accumulated during standard business must be
extracted and stored into a database. Most companies use a relational database stored on a
mainframe server.
Queries/Reports: Querying is a broad term that encompasses all the activities of requesting
data from a data warehouse for analysis. Reports are then generated to display the results for
the specified query. Querying, obviously, is the whole point of using the data warehouse.

OLAP/Multi-dimensional analysis: Relational databases store data in a two dimensional


format; tables of data represented by rows and columns. Multi-dimensional analysis,
commonly referred to as On-Line Analytical Processing (OLAP), offer an extension to the
relational model to provide a multi-dimensional view of the data. These tools allow users to
drill down from summary data sets into the specific data underlying the summaries. Statistical
analysis tools provide summary information and help determine the degree of relationship
between
two
factors.
Data Mining: Data mining is the process of analyzing business data in the data warehouse

to find unknown patterns or rules of information that one can use to tailor business
operations. Data mining predicts future trends and behaviors, allowing businesses to make
proactive, knowledge driven decisions.

6
Prepared By: Ujjal Bhowmik

Data Mining-CSE 5310

7. Present an example where data mining is crucial to the success of a business.


What data mining functions does this business need?
Can they be performed alternatively by data query processing or simple statistical
analysis?
A department store, for example, can use data mining to assist with its target marketing
mail campaign.
Using data mining functions such as association, the store can use the mined strong
association rules to determine which products bought by one group of customers are likely to
lead to the buying of certain other products. With this information, the store can then mail
marketing materials only to those kinds of customers who exhibit a high likelihood of
purchasing additional products.
Data query processing is used for data or information retrieval and does not have the means
for finding association rules. Similarly, simple statistical analysis cannot handle large
amounts of data such as those of customer records in a department store.

7
Prepared By: Ujjal Bhowmik

Data Mining-CSE 5310

8. What is market basket analysis?


What are the objectives of market basket analysis?
For what purpose do we use Apriori algorithm?
Market Basket Analysis:
Market Basket Analysis (MBA), also known as affinity analysis, is a technique to identify
items likely to be purchased together.
Market Basket Analysis is a modeling technique based upon the theory that if you buy a
certain group of items, you are more (or less) likely to buy another group of items. For
example, if you are in an English pub and you buy a pint of beer and don't buy a bar meal,
you are more likely to buy crisps (US. chips) at the same time than somebody who didn't buy
beer.
The set of items a customer buys is referred to as an itemset, and market basket analysis seeks
to find relationships between purchases.
Goals and Objectives of Market Basket Analysis:
Analysis of transaction-level data provides retailers insight that can drive important
merchandising and pricing decisions. First, the Market Basket Analysis provides insight into
the relationships that exist between product groups. This information can assist in steering
product placement and promotional decisions. By understanding affinities and
cannibalization for these decisions, forecasts are more accurate by providing a holistic view
of the impact of price and promotional decisions.
Understanding basket-level dynamics allows retailers to make better decisions related to base
and promotional pricing enabling them to:
Improve cross-selling opportunities across categories
Up-sell to better or more profitable brands within purchased categories
Drive store traffic with the right offer and incentive
Improve sales with in-store displays by co-locating the right items
Understand the holistic impact of promotions and price changes
Improve performance of multiple-purchase offers (e.g. 2 for $2)
Market Basket Analysis empowers marketing and sales organizations to make better,
informed decisions about how and where to deploy their efforts and resources. More so,
strategic action plans can be developed and deployed that align resources around these
insights to increase sales and profitability.
The primary objective of Market Basket Analysis is to improve the effectiveness of
marketing and sales tactics using customer data collected during the sales transaction. It can
also be used to optimize and facilitate business operations particularly with regards to
inventory
control
and
channel
optimization.
What is the use of Apriori algorithm?
The Apriori Algorithm is an influential algorithm for mining frequent itemsets for
boolean association rules. Apriori uses a "bottom up" approach, where frequent subsets are
extended one item at a time (a step known as candidate generation, and groups of candidates
are tested against the data).

8
Prepared By: Ujjal Bhowmik

Data Mining-CSE 5310

9. What is frequent itemset?


Intuitively, a set of items that appears in many baskets is said to be frequent. To be formal,
we assume there is a number s, called the support threshold. If I is a set of items, the support
for I is the number of baskets for which I is a subset. We say I is frequent if its support is s or
more.
Frequent Itemset is an itemset whose support is greater than or equal to a minsup threshold.

Consider the following database containing five transactions. Let min_sup = 60%.
TID
Transaction
T100 a, c, d, f, g, i, m, p
T200 a, b, c, f, l, m, o
T300 b, f, h, j, o
T400 b, c, k, p, s
T500 a, c, e, f, l, m, n, p
Find out all the frequent itemsets using FP Growth Algorithm.

Answer Given On: Lecture sheet no (13-14).

9
Prepared By: Ujjal Bhowmik

Data Mining-CSE 5310

10. Mr. Abdul, owner of a super market would like to find frequent item sets of his products
sold every day in his market. His employees always maintain a data base where each
customers buying information is kept against a customers identification number. The data
base is illustrated below. Now your task is to help him to find all frequent item sets with
minimum support threshold value of 60% using FP growth mining algorithm.
TID Transaction
T100 M, O,N,K,E,Y
T200 D,O,N,K,E,Y
T300 M,A,K,E
T400 M,U,C,K,Y
T500 C,O,O,K,I,E
OR,
(a) Find all frequent itemsets using Apriori and FP-growth, respectively. Compare the
efficiency of the two mining processes.
Apriori:

FP-growth:
See below Figure for the FP-tree.

Efficiency comparison: Apriori has to do multiple scans of the database while FP-growth
builds the FP-Tree with a single scan. Candidate generation in Apriori is expensive (owing to
the self-join), while FP-growth does not generate any candidates.

10
Prepared By: Ujjal Bhowmik

Data Mining-CSE 5310

11. What is support and confidence?


Support, Confidence and Lift:
There are several measures used to understand various aspects of associated products. Let's
understand the measures with the help of an example. In a store, there are 1000 transactions
overall. Item A appears in 80 transactions and Item B occurs in 100 transactions. Items A and
B appear in 20 transactions together.
Support is the ratio of number of times two or more items occur together to the total number
of transactions. Support of A = Pr(A) = 80/1000 = 8% and Support of B = Pr(B) = 100/1000
= 10%.
Confidence is a conditional probability that a randomly selected transaction will include Item
A given Item B. Confidence of A = Pr(A/B) = 20/100 = 20%.
Lift can be expressed as the ratio of the probability of Items A and B occurring together to
the multiple of the two individual probabilities for Item A and Item B. Lift = Pr(A,B) /
Pr(A).Pr(B) = (20/1000)/((80/1000)x(100/1000)) = 2.5.

Consider the following database and find out the support and confidence of
{Milk, Diaper} Beer

11
Prepared By: Ujjal Bhowmik

Data Mining-CSE 5310

12. Suppose, in a diagnostic center, blood test of 10 persons is accomplished, the amount
of calcium in their blood is found as follows 2mg/dl, 5 mg/dl, 6 mg/dl, 10mg/dl, 20mg/dl,
6mg/dl, 18mg/dl, 9mg/dl, 5mg/dl, 1mg/dl. Now find out those persons whose blood
calcium is unusual using Quartile method.

SOLUTION NEEDED

12
Prepared By: Ujjal Bhowmik

Data Mining-CSE 5310

13. State the advantages and disadvantages among mining algorithms (Apriori and FP
Growth Algorithm). Which mining algorithm is better? You think and justify your
answer.
Advantages of Apriori:
The Apriori Algorithm calculates more sets of frequent items.
Disadvantages of Apriori:
The candidate generation could be extremely slow (pairs, triplets, etc.).
The candidate generation could generate duplicates depending on the implementation.
The counting method iterates through all of the transactions each time.
Constant items make the algorithm a lot heavier.
Huge memory consumption
FP-Growth Biggest Advantages:
The biggest advantage found in FP-Growth is the fact that the algorithm only needs to read
the file twice, as opossed to apriori who reads it once for every iteration.
Another huge advantage is that it removes the need to calculate the pairs to be counted, which
is very processing heavy, because it uses the FP-Treee. This makes it O(n) which is much
faster than apriori.
The FP-Growth algorithm stores in memory a compact version of the database.
FP-Growth Bottlenecks:
The biggest problem is the interdependency of data. The interdependency problem is that for
the parallelization of the algorithm some that still needs to be shared, which creates a
bottleneck in the shared memory.
Apriori vs FP-Growth:
Algorithm Technique

Runtime

Memory
usage

Candidate generation is
extremely slow. Runtime
increases exponentially
depending on the number
of different items.

Saves
singletons,
Candidate generation
pairs, triplets, is very parallelizable
etc.

Apriori

Generate
singletons,
pairs, triplets,
etc.

FPGrowth

Insert sorted
Runtime increases linearly,
items by
depending on the number
frequency into a
of transactions and items
pattern tree

Parallelizability

Stores a
Data are very inter
compact
dependent, each node
version of the
needs the root.
database.

Conclusions:
FP-Growth beats Apriori by far. It has less memory usage and less runtime. The differences
are huge. FP-Growth is more scalable because of its linear running time.
Don't think twice if you have to make a decision between these algorithms. Use FP-Growth.

13
Prepared By: Ujjal Bhowmik

Data Mining-CSE 5310

14. Bangladesh is a riverine country. In this country, once upon a time the best
communication medium was water. But now Bangladesh is a developing country and it
has developed its transportation system. A lot of long bridge and culvert has been built
for the last decades. Although people of Bangladesh do not solely dependent on the
water path, yet the people of southern part of the country prefer launch journey to
move from one place to another. To travel by launch is very much risk if it is dense
foggy, higher level of water and higher the water current. Below a database of different
situation of the river is given. Your task is to categories the day when Fog= Dense,
Depth = High, Current = 6 using nave Decision tree method.

ID
1
2
3
4
5
6
7
8

Fog
Dense
Sparse
Medium
Dense
Sparse
Medium
Dense
Sparse

Depth
low
Low
Medium
Medium
High
High
High
Medium

Current
7
9
5
4
3
12
4
2

Status
Risky
Risky
Safe
Safe
Safe
Safe
Risky
Safe

SOLUTION NEEDED

14
Prepared By: Ujjal Bhowmik

Data Mining-CSE 5310

15. Online banking is growing popular in Bangladesh for making transaction fast. Suppose
you are a banker and every day many persons apply for the loan to you. You need to
categorize the loan applicants who apply for the loan through online as safe and risky
category. In order to do it you are given a training data. Your task is to make a classifier
and identify the class of the person whose age is medium and income is high. Let the
class attribute of the training data be status.

ID
1
2
3
4
5
6

Age
Youth
Youth
Middle_aged
Middle_aged
Senior
Senior

Income
Low
Low
Medium
Medium
High
High

Status
Risky
Risky
Safe
Safe
Safe
Safe

SOLUTION NEEDED

15
Prepared By: Ujjal Bhowmik

Data Mining-CSE 5310

16. Distinguish between classification and clustering?


Clustering: Clustering is an unsupervised learning technique used to group similar instances
on the basis of features.
Classification: Classification is a supervised learning technique used to assign predefined
tags to instances on the basis of features.

16
Prepared By: Ujjal Bhowmik

Data Mining-CSE 5310

17. Suppose that a data mining task is to cluster the following six points (with (x, y)
representing location) A1(4,6), A2(2,5), A3(9,3), A4(6,9) ,A5(7,5), A6(5,7). Find the
Divisive, K-means, Agglomerative, K-nearest neighbors method to classify the above
data.

SOLUTION NEEDED

18. Suppose Jagannath University would like to form three Foot Ball team named Big,
Medium and Small Foot Ball team based on their height with all students studying here.
Now help the authority to form the teams by writing an appropriate clustering
algorithm.

SOLUTION NEEDED

17
Prepared By: Ujjal Bhowmik

Data Mining-CSE 5310

19. What do you mean by supervised learner and unsupervised learner?

Supervised learning:

All data is labeled and the algorithms learn to predict the output from the input data.
Supervised learning is the Data mining task of inferring a function from labeled training
data .The training data consist of a set of training examples. In supervised learning, each
example is a pair consisting of an input object (typically a vector) and a desired output value
(also called the supervisory signal). A supervised learning algorithm analyzes the training
data and produces an inferred function, which can be used for mapping new examples. An
optimal scenario will allow for the algorithm to correctly determine the class labels
for unseen instances. This requires the learning algorithm to generalize from the training
data to unseen situations in a reasonable way.

Unsupervised learning:

Supervised learning: Learning from the know label data to create a model then predicting target
class for the given input data.

Unsupervised learning: Learning from the unlabeled data to differentiating the given input data.

All data is unlabeled and the algorithms learn to inherent structure from the input data.
In Data mining, the problem of unsupervised learning is that of trying to find hidden
structure in unlabeled data. Since the examples given to the learner are unlabeled, there is no
error or reward signal to evaluate a potential solution.
Themodelisnotprovidedwiththecorrectresults duringthetraining.

18
Prepared By: Ujjal Bhowmik

Data Mining-CSE 5310

20. Suppose that a data mining task is to cluster the following six points (with (x, y)
representing location) A1(4,6), A2(2,5 A3(9,3), A4(6,9), A5(7,5), A6(5,7)
Suppose initially we assign A, A2 and A3 as the seeds of three cluster that we wish to
find. Find the K-means method to classify the above data.

SOLUTION NEEDED

19
Prepared By: Ujjal Bhowmik

Data Mining-CSE 5310

21. Distinguish between supervised learning and unsupervised learning.


In which type of learning does the clustering fall?
In supervised learning, the output datasets are provided which are used to train the machine
and get the desired outputs whereas in unsupervised learning no datasets are provided, instead
the data is clustered into different classes.
Supervised learning
1) A human builds a classifier based on input and output data
2) That classifier is trained with a training set of data
3) That classifier is tested with a test set of data
4) Deployment if the output is satisfactory
Unsupervised learning
1) A human builds an algorithm based on input data
2) That algorithm is tested with a test set of data (in which the algorithm creates the
classifier)
3) Deployment if the classifier is satisfactory.

N.B: The Clustering fall into Unsupervised Learning category.

20
Prepared By: Ujjal Bhowmik

Data Mining-CSE 5310

22. The following table shows the midterm and final exam grades obtained for students
in a database course.
Mid
Final
Term
Exam
72
84
50
63
81
77
74
78
94
90
86
75
59
49
83
79
65
77
33
52
88
74
81
90
I.
II.

Plot the data. Do x and y seem to have a linear relationship?


Use the method of least squares to find an equation for the prediction of a
students final exam based on the students midterm grade in the course.
III. Predict the final exam grade of a student who received an 86 on the midterm exam.
Answer:
(a) Plot the data. Do x and y seem to have a linear relationship?
Yes, from the scatter graph, it would appear that x and y have a linear relationship.
(b) Use the method of least squares to find an equation for the prediction of a students final
exam grade based on the students midterm grade in the course.
|D| = 12; x = 866/12 = 72.167; y = 888/12 = 74. Using Equations (6.50) and (6.51), we
get w1 = 0.5816 and w0 = 32.028. Therefore, the equation for predicting a students final
exam grade based on the students midterm grade is y = 32.028 + 0.5816x.
(c) Predict the final exam grade of a student who received an 86 on the midterm exam.
Using the formula from part (b), we get y = 32.028 + (0.5816)(86) = 82.045. Therefore, we
would predict that a student who received an 86 on the midterm would get 82 on the final
exam.

SOLUTION NEEDED
23.
24.

21
Prepared By: Ujjal Bhowmik

Data Mining-CSE 5310

25. Find frequent sequential pattern using GSP algorithm.


Suppose now we have 3 items: 1, 2, 3, and let min-support be 50%. The sequence database is
shown in following table:
Object

Sequence

(1), (2), (3)

(1, 2), (3)

(1), (2, 3)

(1, 2, 3)

(1, 2), (2, 3), (1, 3)


SOLUTION

STEP-1:
Make the first pass over the sequence database D to yield all the 1-element frequent
sequences.
Candidate 1-sequences are: <{1}>, <{2}>, <{3}>

STEP- 2A:
Candidate Generation: Merge pairs of frequent subsequences found in the (k-1)th pass to
generate candidate sequences that contain k items
Candidate 1-sequences are: <{1}>, <{2}>, <{3}>
Base case (k=2): Merging two frequent 1-sequences <{i 1 }> and <{i 2 }> will produce two
candidate 2-sequences: <{i 1 } {i 2 }> and <{i 1 i 2 }>

22
Prepared By: Ujjal Bhowmik

Data Mining-CSE 5310

Candidate 2-sequences are:


<{1, 2}>, <{1, 3}>, <{2, 3}>,
<{1}, {1}>, <{1}, {2}>, <{1}, {3}>,
<{2}, {1}>, <{2}, {2}>, <{2}, {3}>,
<{3}, {1}>, <{3}, {2}>, <{3}, {3}>
STEP-2B:
Candidate Pruning: Prune candidate k-sequences that contain infrequent (k-1)-subsequences
After candidate pruning, the 2-sequences should remain the same:
<{1, 2}>, <{1, 3}>, <{2, 3}>,
<{1}, {1}>, <{1}, {2}>, <{1}, {3}>,
<{2}, {1}>, <{2}, {2}>, <{2}, {3}>,
<{3}, {1}>, <{3}, {2}>, <{3}, {3}>
STEP- 2C and 2D:
Candidate

Support

<{1, 2}>

<{1, 3}>

<{2, 3}>

<{1}, {1}>

<{1}, {2}>

<{1}, {3}>,

<{2}, {1}>,

<{2}, {2}>,

<{2}, {3}>,

<{3}, {1}>

<{3}, {2}>,

<{3}, {3}>

Support Counting and Candidate Elimination:


After candidate elimination, the remaining frequent 2-sequences are:
<{1, 2}> (support=0.6) ,
<{2, 3}> (support=0.6),
<{1}, {2}> (support=0.6),
<{1}, {3}> (support=0.8),
<{2}, {3}> (support=0.6)`

23
Prepared By: Ujjal Bhowmik

Data Mining-CSE 5310

Repeat Step 2a: Candidate Generation


Generate 3-sequences from the remaining 2-sequences : <{1, 2}> , <{2, 3}>, <{1}, {2}>,
<{1}, {3}> , <{2}, {3}>
3-sequences are:
<{1, 2, 3}> (generated from <{1, 2}> and <{2, 3}>),
<{1, 2}, {3}> (generated from <{1, 2}> and <{2}, {3}>),
<{1}, {2}, {3}> (generated from <{1}, {2}> and <{2}, {3}>)
Repeat Step 2b: Candidate Pruning
2-sequences : <{1, 2}> , <{2, 3}>, <{1}, {2}>, <{1}, {3}> , <{2}, {3}>
3-sequences are: <{1, 2, 3}> <{1, 2}, {3}> <{1}, {2}, {3}>
<{1, 2, 3}> should be pruned because one 2-subsequences <{1, 3}> is not frequent.
<{1, 2}, {3}> should not be pruned because all 2-subsequences <{1}, {3}> and <{2},
{3}> are frequent.
<{1}, {2}, {3}> should not be pruned because all 2-subsequences <{1}, {2}>, <{2},
{3}> and <{1}, {3}> are frequent.
Repeat Step 2b: Candidate Pruning
2-sequences : <{1, 2}> , <{2, 3}>, <{1}, {2}>, <{1}, {3}> , <{2}, {3}>
3-sequences are: <{1, 2, 3}
> <{1, 2}, {3}> <{1}, {2}, {3}>
So after pruning, the remaining 3-sequences are: <{1, 2}, {3}> and <{1}, {2}, {3}>
Repeat Step 2c: Support Counting
3-sequences are: <{1, 2}, {3}> <{1}, {2}, {3}>

Candidate

Support

<{1, 2}, {3}>

<{1}, {2}, {3}>

Thus, there are no 3-sequences left.


So the final frequent sequences are:
<{1}>, <{2}>, <{3}>, <{1, 2}>, <{2, 3}>, <{1}, {2}>, <{1}, {3}>, <{2}, {3}>

24
Prepared By: Ujjal Bhowmik

Data Mining-CSE 5310

26. What is sequential pattern mining? How is it different from frequent item set
mining?
Sequential pattern mining is a topic of data mining concerned with finding statistically
relevant patterns between data examples where the values are delivered in a sequence. It is
usually presumed that the values are discrete, and thus time series mining is closely related,
but usually considered a different activity.

Frequent Itemset Mining:

25
Prepared By: Ujjal Bhowmik

Data Mining-CSE 5310

27. Explain the self-joining technique to generate candidate sequential pattern in GSP
algorithm giving example.

26
Prepared By: Ujjal Bhowmik

UJJAL

Data Mining-CSE 5310

27
Prepared By: Ujjal Bhowmik

Data Mining-CSE 5310

28
Prepared By: Ujjal Bhowmik

Вам также может понравиться