4 views

Uploaded by IJIRST

Finding interesting patterns from large amounts of data is an important task of data mining. There are many data mining tasks, such as classification, clustering, association rule mining, and sequential pattern mining. Sequential pattern mining extracts frequent subsequences from a sequence database which is helpful in making predictions, improving usability of systems, detect events and in making strategic product decisions in applications such as web user analysis, stock trend prediction, DNA sequence analysis. Subsequences can be contiguous and non-contiguous. Moreover the mining algorithm is classified into three categories: periodic patterns, statistically patterns, and approximate patterns. This paper discusses about few such pattern mining algorithms.

- Study Plan R
- Ruhul Sarker, Hussein a. Abbass, Charles Newton - Heuristic and Optimization for Knowledge Discovery1
- Web-based Data Mining Tools
- ARM_Survey.pdf
- FV-Lecture8
- p 025122127
- Novel Approch
- Data Mining For Fraud Detection
- IJAIEM-2014-12-31-97
- Data Mining Introduction
- i Jcc n 04112012
- 01
- 18
- Data Mining and Data Warehousing
- SAP Sales and Distribution
- 1intro.ppt
- 07 Fp Advanced
- Mine High Utility Itemset using UP-Tree and FP-Growth
- Discovering High Utility Itemsets Using Hybrid Approach
- Defining Multiple Replicats to Increase GoldenGate Performance

You are on page 1of 3

in Sequence Data Sets

Kirti Mirgal

PG Student

Department of Computer Engineering

Pillai Institute of Information Technology, Engineering, Media

Studies and Research, Panvel, India

Associate Professor

Department of Information Technology

Pillai Institute of Information Technology, Engineering, Media

Studies and Research, Panvel, India

Abstract

Finding interesting patterns from large amounts of data is an important task of data mining. There are many data mining tasks,

such as classification, clustering, association rule mining, and sequential pattern mining. Sequential pattern mining extracts

frequent subsequences from a sequence database which is helpful in making predictions, improving usability of systems, detect

events and in making strategic product decisions in applications such as web user analysis, stock trend prediction, DNA sequence

analysis. Subsequences can be contiguous and non-contiguous. Moreover the mining algorithm is classified into three categories:

periodic patterns, statistically patterns, and approximate patterns. This paper discusses about few such pattern mining algorithms.

Keywords: CloSpan, cSPADE, Data Mining, Random Projections, Sequential Pattern Mining, Subsequences

_______________________________________________________________________________________________________

I.

INTRODUCTION

Sequences are an important kind of pattern which occur frequently in many fields such as medical, business, financial, customer

behavior, educations, security, and other applications. Sequential pattern mining algorithms can be applied on various types of

data such as transaction databases, sequence databases, streams, strings, spatial data, graphs, etc. The analysis of the data in these

applications needs to be carried out in different ways to satisfy different application requirements, and it needs to be carried out

in an efficient manner [1].

Sequential pattern mining handles data in large sequential data sets. It has gained popularity in in marketing in retail industry,

biomedical research, DNA sequence patterns, financial industry. Most common applications are discovery of motifs in DNA

sequences, identifying interesting share price movements in financial industry, analysis of web log for web usage, customer

shopping sequences and the investigation of scientific or medical processes and so on.

Sequential pattern mining is find the relationships between occurrences of sequential events for looking for any specific order

of the occurrences. In the other words, sequential pattern mining aims at finding the frequently occurred sequences to analyse the

data or predict future data or mining periodical patterns [2][3]. It uses support as the criteria to evaluate frequency but is not

efficient to discover some patterns. The problem of mining sequential pattern can be partitioned into three categories: periodic

patterns, statistically patterns, and approximate patterns [4]. In this paper we will discuss a few approximate pattern mining

techniques such as CloSpan[5] , cSPADE [6], and Random projections[7].

II. CATEGORIES OF SEQUENTIAL PATTERN MINING

The problem of mining sequential pattern can be partitioned into three categories: periodic patterns, statistically patterns, and

approximate patterns [4].

Periodic Patterns

This model is not flexible and it is unable to find patterns whose occurrences are asynchronous [8]. Periodicity detection in time

series database is an important data mining problem and has a number of applications. For example, The gold prices increase

every weekend is a periodic pattern. As this model is restrictive it may fail to detect some interesting pattern if its occurrence is

misaligned due to noise events. A pattern can be partially represented to provide a more flexible model. For example, pattern

length four (I1, *, *) is a partial pattern indicating that the frist symbol must be I1. Behaviour of the system may change over time

and some patterns may not be present all the time.

Namely two parameters min-rep and max-dist, are used to specify the minimum number of occurrences required within each

subsequences and the maximum disturbance between any two successive subsequences. Given a sequence S, the parameters minrep and max-dist and the maximum period length Lmax, we can find the valid subsequences that have the most repetitions for each

valid pattern whose period length does not exceed Lmax in three phases. If parameters are not set properly, noise may be qualified

as a pattern. The following three phases outline algorithm for mining periodic patterns in brief [9].

46

(IJIRST/ Volume 3 / Issue 03/ 009)

The first phase: For each symbol I, the distance between any two occurrences of I are examined and then for each period l, the

set of symbols whose number of times are at least min-rep are sent to the next phase. Since there are a huge number of

candidates, a pruning method is needed to reduce it.

The second phase: In this phase, the single patterns (1-pattern) are generated. For each period l and each symbol I a candidate

pattern (I,*,*,,*) is formed that number of symbol * is (l-1).

The third phase: After discovering the single patterns in previous phase, i-patterns are generated from the set of valid (i-1)patterns and then these patterns are validated. In this phase, we can apply some heuristics. For example, it is obvious that if a

pattern is valid, then all of its generalizations are valid. Pattern (I1, I2*) is a generalization of pattern (I1,I2,I3).

Statistically Significant Patterns

For sequential patterns support and confidence are the two important measures. The support evaluates frequencies of the patterns

and the confidence evaluates frequencies of patterns in the case that sub-patterns are given. In some applications, the frequent

occurrence of a pattern may not be of significance importance and interesting as the few occurrences of an expected rare pattern.

This pattern called surprising pattern instead of frequent pattern. The information gain metric used in the information theory

field, may be useful to evaluate the degree of surprise of the pattern [10]. Target is finding set of patterns that have information

gain higher than minimum information gain threshold.

Given a pattern P=(I ,I ,,I ) and an information gain threshold min-gain, the task is to find all patterns whose information

1 2

gain in the sequence S exceed the min-gain value. There are some heuristics and methods that user can set the value of this

threshold.

Information gain of pattern P is defined as follows:

Info-gain(P)=Info(P)*Support(P)

(1)

Info(P)=Info(I )+ Info(I )++ Info(I )

(2)

1

(3)

Where prob(I ) is probability that symbol I occurs and |I| is number of events in S.

k

The main strategy to tackle the problem of mining statistically significant patterns is a recursive method which at the ith level

of recursion examines the patterns with i events. In some applications, users may want to find the k most surprising patterns. On

the other words, users may want to find top k patterns whose information gain is greater than a threshold. However, the major

limitation of information gain value is that it does not recognize location of the occurrences of the patterns.

Approximate Patterns

Noisy data are common properties of large real world databases. A random error or variance in a measured variable is noise. The

presence of noise can prevent the occurrence of a pattern and may not be recognized. moreover large patterns are more

vulnerable to distortion caused by noise so it is necessary to allow some flexibility in pattern matching. The previous models

only consider exact match of the pattern in the data. An approximate pattern is defined as a sequence of symbols which appears

more than a threshold under certain approximation types in a data sequence. To resolve the approximate pattern mining problem,

the concept of compatibility matrix is introduced [4]. This matrix provides a probabilistic connection from observed values to the

true values. Based on the compatibility matrix, real support of a pattern can be computed.

A new metric, namely match is defined to quantify the significance of a pattern. The combined effect of support and match

may need to scan the entire sequence database many times. Similar to other data mining methods, to tackle this problem

sampling based algorithms can be used. Consequently, the number of scans through the entire database is minimized.

Given a pattern P=(I ,I ,,I ) and a symbol sequence S=(I ,I ,,I ) where lslp, match of P in S (denoted by M(P,S)) is

1 2

lp

ls

defined as the maximal conditional probability P in every distinct subsequence of length lp in S. Eq. (4) show how compute

match provided that each observed symbol is generated independently.

M(P,S)=max M(P,s)

if ls>lp

sS

M(P,s)=prob(P|s)=

1ilp

C(I ,I )

i

if ls=lp

(4)

Mining of Closed Sequential Patterns

Ramin Afshar proposed a frequent closed subsequence mining approach CloSpan [5] that mines large sequences efficiently. This

algorithm produces number of efficiently search pruning techniques. The algorithm makes use of hash technique that has two

steps to carry out efficient optimization of the search space: 1) it create a superset of joint common sequences known as the LS

set, and keeps the set in prefix order and 2) then it performs post-pruning to eliminate non-closed sequences. It works in

following manner:

Classification is performed on each set of item and carries out the elimination of not frequent items and sequences that are

empty.

47

(IJIRST/ Volume 3 / Issue 03/ 009)

The Clospan method is recursively applied on the prefix search tree in depth first search manner and builds the prefix

sequence corresponding to it. Lastly, it removes the free sequences.

Then is uses a hash index on the projected database size and only the sequences whose projected database size is same as

that of current sequence are tested.

This algorithm performs well for exact pattern matching problems that is not suitable for approximate pattern matching

problems. As the dataset size increases, execution time of the algorithm also increases rapidly.

Sequence Mining in Domain Categories

Mohammed J. Zaki proposed cSPADE [6] algorithm for mining frequent sequences. It is an efficient algorithm based on a

number of syntactical limitations. They are size of the sequences, limiting the min or max gap on consecutive sequence elements,

to put a time slot on acceptable sequences and searching sequences that are predictive of one or multiple classes, even rare ones.

This algorithm methodically searches the sequence grid formed by the subsequence relation, from the simplest items to the

nearly particular frequent sequences in a depth-first (or breadth-first) manner. It requires preprocessing of data in a special

format, as it is based on syntactical constraints. For large database more time is needed for pre-processing the data as a result the

performance degrades.

Motifs Mining using Random Projections

J. Buhler has proposed an algorithm based on random projections [7] of input substrings. It carries out a number of tests on a

basic iterant. The Projection algorithm has two parts:

A random projection is selected and hashed with each l-mer x in the input sequences to its hash bucket with every test.

The required pattern is searched in a hash bucket that has adequate entries by applying an order of improving steps

This algorithm is more productive in searching required pattern in simulated data, but it needs improvement in time, in space

and maybe for future more complex biological data and real time.

IV. CONCLUSION AND FUTURE SCOPE

In this work, we provide a brief overview of models of sequential patterns in [4]. The paper theoretically shows the three types of

sequential pattern model and some properties of it. These models fall into three categories called periodic pattern, statistically

pattern, and approximate pattern. The first model is rigid but provides full periodicity and partial periodicity. In former, every

time point contributes to the cyclic behavior of a time series. In contrast, in partial periodicity, some time points contribute to the

cyclic behavior of a time series. Use of information gain as new metric helps to find surprising patterns which comparing with

the frequent patterns demonstrates the superiority of surprising patterns. The third model, approximate sequential pattern,

provides a means to verify noise. But still being and active research area in the data mining field. The ColSpan[5], cSPADE[6],

Random Projections[7] are approximate pattern mining techniques which allows to find the approximate patterns.

REFERENCES

[1]

[2]

Dong G., and Pei J., Sequence Data Mining, Springer, 2007.

Agrawal R., and Srikant R., "Fast Algorithms for Mining Association Rules", 20th Int. Conf. Very Large Data Bases, VLDB, Morgan Kaufmann, 1994,

1994, pp. 487-499.

[3] Agrawal R., and Srikant R., "Mining Sequential Patterns", 11th Int. Conf. on Data Engineering, IEEE Computer Society Press, Taiwan, 1995, pp. 3-14.

[4] Wang W., and Yang J., Mining Sequential Patterns from Large Data Sets, Springer, 2005.

[5] X. Yan, J. Han, and R. Afshar, CloSpan: Mining Closed Sequential Patterns in Large Datasets, Proc. SIAM Intl Conf. Data Mining (SDM), 2003.

[6] M.J. Zaki, Sequence Mining in Categorical Domains: Incorporating Constrains, Proc. Ninth Intl Conf. Information and Knowledge Management

(CIKM), pp. 442-429, 2000.

[7] J. Buhler and M. Tompa, Finding Motifs Using Random Projections, J. Computational Biology, vol. 9, no. 2, pp. 225-242, 2002.

[8] Zhao Q., and Bhowmick S. S., Sequential Pattern Mining: A Survey, Technical Report, CAIS, Nanyang Technological University, No. 2003118,

Singapore, 2003.

[9] Wang W., and Yang J., Mining Sequential Patterns from Large Data Sets, Springer, 2005.

[10] Yang J., Wang W., Yu P. S., and Han J., Infominet: Mining Surprising periodic Patterns, In Proceeding of the 7th ACM Int. Conf. on Knowledge

Discover and Data Mining (KDD), 2001, pp. 395-400.

48

- Study Plan RUploaded byMochammad Adji Firmansyah
- Ruhul Sarker, Hussein a. Abbass, Charles Newton - Heuristic and Optimization for Knowledge Discovery1Uploaded byvladislav
- Web-based Data Mining ToolsUploaded byLewis Torres
- ARM_Survey.pdfUploaded bywaqas7136
- FV-Lecture8Uploaded byFaizur Rahman Mohammed
- p 025122127Uploaded byAJER JOURNAL
- Novel ApprochUploaded byJeena Mathew
- Data Mining For Fraud DetectionUploaded byIJMER
- IJAIEM-2014-12-31-97Uploaded byAnonymous vQrJlEN
- Data Mining IntroductionUploaded byKrishna Kumar Kushwah
- i Jcc n 04112012Uploaded byeditor3854
- 01Uploaded byAnonymous 7wk2G5BX
- 18Uploaded byapi-3814854
- Data Mining and Data WarehousingUploaded byGrishma Bobhate
- SAP Sales and DistributionUploaded byANUJ SHARMA
- 1intro.pptUploaded byAkanksha Singh
- 07 Fp AdvancedUploaded byamreshkr19
- Mine High Utility Itemset using UP-Tree and FP-GrowthUploaded byseventhsensegroup
- Discovering High Utility Itemsets Using Hybrid ApproachUploaded byEditor IJRITCC
- Defining Multiple Replicats to Increase GoldenGate PerformanceUploaded bynt29
- Sellingvs.marketingUploaded bydebnath_ban
- 00284___aceb3d8922ab6abdcb23dea049e9b212Uploaded byTiu Ton
- NGDM_Lawrence O Hall and Kevin W BowyerUploaded byapi-3798592
- DWDM_LABUploaded byswetha adari
- Apl i Kasi Data MiningUploaded byJack Code
- bifinal-130825050713-phpapp02Uploaded byksachchu
- A Dynamic Approach for Discovering Maximal Frequent ItemsetsUploaded bytasmia_172
- triumph and tragedy lessonsUploaded byapi-394725139
- ROCUploaded bySedat Ister
- Using Abstract ClassUploaded byAnkur Kumar

- The Effect of Diverse Recording Devices on Forensic Speaker Apperception SystemUploaded byIJIRST
- Postprocessing of Compacted Images through Consecutive DenoisingUploaded byIJIRST
- Experimental Analysis of Friction Stir Processing of Tig Welded Aluminium Alloy 6061Uploaded byIJIRST
- Stress Analysis of 220cc Engine Connecting RodUploaded byIJIRST
- Patterns of Crop Concentration, Crop Diversification and Crop Combination in Thiruchirappalli District, Tamil NaduUploaded byIJIRST
- Development of Satellite Data for Infrastructure Updation and Land Use/Land Cover Mapping - A Case Study from Kashipur & Chhatna Block, Bankura & Purulia District, West BengalUploaded byIJIRST
- Vibration Analysis of Composite Leaf Spring by Finite Element MethodUploaded byIJIRST
- Physico-Chemical Analysis of Selected Ground Water Samples in and Around Nagapattinam District, TamilnaduUploaded byIJIRST
- Multi-Physics based Simulations of a Shock Absorber Sub-SystemUploaded byIJIRST
- Manganese: Affecting Our Environment (Water, Soil and Vegetables)Uploaded byIJIRST
- Satellite Dish Positioning SystemUploaded byIJIRST
- Arduino-UNO Based Magnetic Field Strength MeasurementUploaded byIJIRST
- Custom ROMUploaded byIJIRST
- Currency Recognition Blind Walking StickUploaded byIJIRST
- Study on Performance Evaluation of Forced Convection Solar Dryer for Turmeric (Curcuma Longa L.)Uploaded byIJIRST
- Experimental Investigation On Concrete by Replacement of Sand by Silica Sand and Artificial SandUploaded byIJIRST
- Induction Motor Drive using SPWM fed Five Level NPC Inverter for Electric Vehicle ApplicationUploaded byIJIRST
- Development of Tourism Near Loktak Lake (Moirang) in Manipur Using Geographical Information and Management TechniquesUploaded byIJIRST
- Performance Analysis of Organic Rankine Cycle (ORC) Working on Different Refrigerant Fluids having Low Boiling PointUploaded byIJIRST
- Experimental Investigation on the Effect of Use of Bottom Ash as a Replacement of Fine AggregatesUploaded byIJIRST
- Reconfigurable Manufacturing Systems using the Analytical Hierarchical Process (AHP) - A ReviewUploaded byIJIRST
- Efficient Revocation of Data Access in Cloud Storage Based on ABE-SchemeUploaded byIJIRST
- Impact of Different Soils and Seismic Zones on Varying Height of Framed StructuresUploaded byIJIRST
- Rock Deformation by Extesometers for Underground Powerhouse of Sardar Sarovar Project (Gujarat)Uploaded byIJIRST
- Intelligent Irrigation SystemUploaded byIJIRST
- Analysis of Agent Oriented Software EngineeringUploaded byIJIRST
- Comparative Study of Inner Core, Peripheral and RC Shear Wall SystemUploaded byIJIRST
- Literature Review for Designing of Portable CNC MachineUploaded byIJIRST
- Women Protection Mechanism with Emergency Communication using Hand Waving PatternUploaded byIJIRST
- Infiltration, Permeability, Liquid Limit and Plastic Limit of SoilUploaded byIJIRST

- cholla academy ela grade 9 curriculum map writing and researchUploaded byapi-320980022
- Study Regarding the Effects of Brain Gym on Student LearningUploaded bymuthoharoh
- ResearchMethodology_Week01Uploaded bysabihuddin
- Paradigms.pdfUploaded byDodi Cahyadi
- Public Policy Advocacy for Social ChangeUploaded byValentina Petrescu Anichitei
- 19_Topical Anesthetics for Dermal InstrumentationUploaded byMarta Maj
- Factors Influencing Usage of Internet for Academic Purposes Using Technology Acceptance ModelUploaded byJyothi Mallya
- How to Find and Fund a PhdUploaded byJhon
- alternativedesignsandmethodsforcsm_9011Uploaded byTien Pham
- Ernest Kurtz, Research on Alcoholics AnonymousUploaded byCarey Pickard
- Leading by Leveraging Culture CMR 2003Uploaded byRohit Kumbhar
- ThesisUploaded byOnaj Orued Acihcal
- Rev 1Uploaded byruchitasomani
- Digital RevolutionUploaded byFlávia Pessoa
- FINALAbstractBook.pdfUploaded byTamu Le
- Comparison and Contrast EssayUploaded byBrian
- Factors Affecting EFL College Students’ Reading ComprehensionUploaded byAhmad Mujib
- Vol_17_2Uploaded bymayamasergio2908
- Fraud Detection, Redress and Reporting by AuditorsUploaded bykadekayu
- Multi GradeUploaded byAngelica Santos
- Abs List v4 by Subject Area and With IntroUploaded bystaimouk
- Marko a. Rodriguez - A Methodology for Studying Various Interpretations of the N,N-Dimethyltryptamine-Induced Alternate RealityUploaded byscribd_enscribed
- Design and Implementation of a Computerized Educational Administrative Information SystemUploaded byYekeen Abiodun
- Ann Campbell Keller - Science in Environmental PolicyUploaded byFebri Priyoyudanto
- Training Text MaterialUploaded bySubbu Suresh
- PharmacyUploaded bybenageddon
- jcq1.pdfUploaded byDoctor Rida
- Graphic OrganizerUploaded byapi-3731687
- cap 2 final paperUploaded byapi-413271136
- climate-change.pdfUploaded byTarun Mattaparthy