SlideShare a Scribd company logo
International Journal of Computer Science, Engineering and Applications (IJCSEA) Vol.2, No.5, October 2012
DOI : 10.5121/ijcsea.2012.2506 61
MMP-TREE FOR SEQUENTIAL PATTERN
MINING WITH MULTIPLE MINIMUM
SUPPORTS IN PROGRESSIVE DATABASES
K.M.V.Madan Kumar1
, P.V.S.Srinivas2
and C.Raghavendra Rao3
1
Research Scholar, CMJ University, Shillong, Meghalaya, India
Assoc Professor, TKR College of Engg &Tech, Hyderabad, AP, India
2
Professor, Geethanjali College of Engineering & Technology, Hyderabad, India
3
Professor, Dept. of CIS, University of Hyderabad, Hyderabad, India
ABSTRACT
Although there may be lot of research work done on sequential pattern mining in static, incremental,
progressive databases, the previous work do not fully concentrating on support issues. Most of the
previous approaches set a single minimum support threshold for all the items or item sets. But in real
world applications different items may have different support threshold to describe whether a given
item or item set is a frequent item set. This means each item will contain its own support threshold
depends upon various issues like cost of item, environmental factors etc. In this work we proposed a
new approach which can be applied on any algorithm independent of that whether the particular
algorithm may or may not use the process of generating the candidate sets for identifying the frequent
item sets. The proposed algorithm will use the concept of “percentage of participation” instead of
occurrence frequency for every possible combination of items or item sets. The concept of percentage
of participation will be calculated based on the minimum support threshold for each item set. Our
algorithm MS-PISA , which stands for Multiple Support Progressive mIning Of Sequential pAtterns , which
discovers sequential patterns in by considering different multiple minimum support threshold values for
every possible combinations of item or item sets.
KEYWORDS
Multiple minimum supports, Progressive sequential pattern, Percentage of participation
1. INTRODUCTION
Sequential Pattern mining is one of the most important research issues in data mining. Which
was first introduced by Agarwal and Srikanth [2] and can be described as follows: we are given a
set of data sequences which will be used as input data. Each sequence consists of a list of ordered
item sets containing a set of different items. The sequential pattern mining finds all Sub
sequences with frequencies lower than min support which is given by user. After that there have
been a significant research work was done by various researchers on sequential pattern mining
and defined number of algorithms not only for static data base [2],[3] Where data do not change
over time but also for incremental data bases[6] where there will be new data arriving as the time
goes by. In addition to above researchers some other researchers derived other kind of sequential
patterns like closed sequential patterns. Constraint sequential pattern, maximal sequential
patterns, spatio temporal sequential patterns. Sequential pattern on specific type of data, on
stream data.
International Journal of Computer Science, Engineering and Applications (IJCSEA) Vol.2, No.5, October 2012
62
The sequential patterns mining in progressive data bases will be having a vide use in many
applications. For example in mobile applications, share market. Moreover the stock price
changes of the company before five years ago may have very little influence on the quotes of
other stocks. To remedy the problem of sequential pattern mining in progressive databases,
Haung.et.al [6] proposed an algorithm PISA which takes the concept of period of Interest (POI)
but Haung.et.al[6] also considered only uniform min.sup like all the previous researchers.
But considering uniform min.sup, implicitly assumes that all items in the data base have similar
frequency. However some items may appear very frequently in the data base while other rarely
appears. Under such circumstances, if we set the value of min support too high we will not find
those rules involving rare items in the data base. On the other hand if we set that value too low, it
will generate huge amounts of meaningless patterns. Therefore liue .et.al [4] first address this and
propose the concept of multiple minimum supports (MMS in Short)in association rule mining. In
this study we first extend the definition of sequential pattern by considering the concept of MMS,
which allows users to specify multiple min.sup for each item. Items with low frequencies will be
specified with lower minimum support and pattern involving these items can be easily retrieved
for further decision support.
But the major problem on frequent pattern mining with MMS is that the downward closure
property no longer holds in the mining process, which means that a super-pattern of an infrequent
pattern might be frequent pattern. To effectively reduce the search space in a level –wise
methods, Liu.et.al proposes the sorted closure property, where all item in data sets are sorted in
ascending order by their MIS values. The sorted closure property however is invalid in sequential
pattern mining since the order in the data sequences cannot be altered. Therefore to discover
complete set of sequential patterns with MMS is not straight forward. Based on the new
definitions of sequential patterns with MMS, we first proposed the concept called Percentage of
participation (POP) where POP is the percentage of participation of the item or item set with
respect to the no of sequences in the POI i.e. |Db| and minimum item support (MIS).
The structure of this paper as follows. In section 2, we first review the concept of PISA as the
basis of our approach. Section 3 introduces the definition of the progressive sequential pattern
mining with multiple minimum supports (MMS) by involving the Percentage of participation
(POP).Our algorithm called MS-PISA will be discussed in section 4 and we have conclusion in
section 5.
2. RELATED WORK
Haung.et.al proposed algorithm which works for mining of sequential patterns in progressive data
bases. The input of progressive sequential pattern mining is a user specified length of POI and
user-defined minimum support threshold. POI is a sliding window protocol whose length is a user
specified time interval continuously advances as the time advances. The sequences having
elements whose time stamps are within the POI will be considered to contribute for |Db| for
current sequential patterns (where |Db| is the number of sequences in the POI). The sequences
having elements with time stamps older than the POI will be considered as absolute and removed.
In this they consider an element which is set of items appeared in a sequence with the time stamp
denoted by p, q. When the sequence database is denoted by Db then a sequence in a sequence
database is a set of elements ordered by time stamp increasingly. They also introduced a time
interval denoted by [p, q] to represent the time interval between time stamp p and q. Therefore
Dbp,q is defined as a subset of the databases Db containing the elements of sequences from time
stamp p to time stamp q. They also aimed finding the sequential patterns containing no repeated
elements and having more than one element. In the beginning, the mining algorithm receives new
data of different timestamps with a user-defined minimum support threshold and the length of
International Journal of Computer Science, Engineering and Applications (IJCSEA) Vol.2, No.5, October 2012
63
POI. For each POI, the mining algorithm progressively updates the sequences in the database.
Then, the algorithm outputs frequent sequential patterns which are qualified by the uniform
minimum support threshold. Note that the mining algorithm should prune away obsolete data
from the sequence database, delete obsolete sequential patterns rapidly, and update |Db| with
the number of sequences which have elements in the urrent POI. The process will be continuously
executed until there is no more newly arriving data. To solve the progressive sequential pattern
mining algorithm, Haung.et.al proposed a progressive mining algorithm PISA. Where PISA
maintains a PS-tree to keep the information of the progressive database and up-to-date
sequential patterns in each POI. The nodes in PS-tree can be divided into two different types.
They are root node and common nodes. Root node is the root of PS-tree containing nothing but a
list of common nodes as its children. Each common node stores two information, say node label
and a sequence list, as shown in the top-left side of Fig. 2. The label is the same as the element in
a sequence. The sequence list stores a list of sequence IDs to represent the sequences containing
this element. Each sequence ID in the sequence list is marked by a corresponding timestamp.
Whenever there are a series of elements appearing in the same sequence, there will be a series of
nodes labeled by each element, respectively, with the same sequence IDs in their sequence lists.
Then, the first node will be connected to the root node and the second node representing the
following element will be connected to the first node. The other nodes will be connected
analogously. Note that, in such a way, the path from root node to any other node will represent
the candidate sequential pattern appearing in this sequence. The appearing timestamp for each
candidate sequential pattern will be marked in the node labeled by the last element. If there is
another sequence having the same pattern, the sequence ID will be inserted into the sequence lists
of the same nodes labeled by these elements on the path. On the other hand, if an element
appearing in a sequence is obsolete, the corresponding sequence ID will be removed from the
sequence list of the node. In addition, if a node has no sequence in the sequence list, it will be
pruned away from PS-tree.
Figure 1
For the above example shown in figure1 the algorithm PISA constructs the candidate sets as
follows which is shown in figure2.
International Journal of Computer Science, Engineering and Applications (IJCSEA) Vol.2, No.5, October 2012
64
Figure 2
Before this many researchers have developed various methods to find frequent sequential patterns
with a static database. Apriori All [2], GSP [3] are the milestones of sequential pattern mining
algorithms based on traditional association rules mining technique. SPADE [5], illustrated by
Zaki, systematically searched the sequence lattice spanned by the subsequence relation. Han et
al. and Pie et al. brought up Free Span and Prefix Span, which found sequential patterns by
constructing sub databases of the entire database. Zhang et al. proposed two algorithms
GSP+ and MFS+ based on static algorithms GSP and MFS (also derived by the
same authors).But as discussed early in section1 all the researchers considered uniform
min.sup to describe whether the sequence is frequent or not.
International Journal of Computer Science, Engineering and Applications (IJCSEA) Vol.2, No.5, October 2012
65
3. PROBLEM DEFINITION
In this section, we formally give definitions used in the discovery of sequential pattern with
MMS. Let I denote the set of items in the database, and a subset of I is called an item set. A
customer’s data-sequence is an ordered list of items with time stamps. Therefore, a sequence, say
α, can be represented as <(a1: t1), (a2: t2), (a3: t3), …, (an: tn)>, where a j is an item, and tj stands
for the time when aj occurs, 1≤ j≤n, and tj-1≤tj for 2≤j≤ n.
If several items occur at the same time in tj for 2 j the sequence, they are ordered alphabetically.
Definition 1. Given an item set Iq = (i1, i2...im), we say item set Iq occurs in α if integers 1≤k1 <
k2 <…< km ≤n exist such that, i1 = ak1, i2 = ak2,... ,im = akm and tk1= tk2=...= tkm. We refer to k1 and
tk1 jointly as the position and the time that Iq occurs in α, respectively.
Definition 2. Let β = <I1, I2...Is> (Iq subset or equal to I for 1 ≤q≤ s) be a sequence of item set.
Assume that each Iq in β occurs in α. Then we say sequence β occurs in α, or is a subsequence of α
if tI1 < tI2<...< tIs, where tIq (1≤ q ≤s) is the time, at which Iq occurs in α.
Definition 3. A sequence database S is formed by a set of records <sid, s >, where sid is the
identifier of this data-sequence and s is a data-sequence. For a given sequence β, the support
count of sequence β in S are defined as follows:
supp (β) = |{(sid, s)| (sid, s) ЄS ^β is a subsequence in s}|
The following definitions are related to the concept of MMS. In this model, the definition of the
minimum support is changed. Each item in the database can have its minsup, which is expressed
in terms of minimum item support (MIS). In other words, users can specify different MIS values
for different items.
Definition 4: Let MIS (i) denote the MIS value of item i (i subset or equal to I). Given an item
set Iq = (i1,i2,...,im), the MIS value of item set Iq = (i1,i2,...,im) (1≤ k ≤m), denoted as MIS(Iq), is
equal to:
min [MIS (i1), MIS(i2),…, MIS(im)]
Definition 5: Given a sequence β = <I1.I2...Is>
(Iq subset or equal to I for 1≤ q≤ s), the minimum support threshold of β, denoted as MIS (β), is
equal to:
min [ MIS(I1), MIS(I2),…, MIS(Is)]
Definition 6: Given a sequence database S and a sequence β, we call β is frequent in S or β is a
sequential pattern in S if supp (β) ≥MIS (β).
Definition 7: Progressive sequential pattern mining problem. “Given a user-specified
length of POI and a user- defined minimum support threshold, find the complete set of
frequent subsequences whose occurrence frequencies are greater than or equal to the minimum
support times the number of sequences in the recent POI of a progressive database.”
International Journal of Computer Science, Engineering and Applications (IJCSEA) Vol.2, No.5, October 2012
66
Definition 8: Let I be an item or item set and MIS (I) be the minimum item support threshold of item I. |Dbp,q
| is
the number of sequences In the POI. Timestamps between p ,q.
Percentage Of Participation(POP)= 100/ |Dbp,q
|* MIS(I)
With the provision of different minimum item support thresholds for different items, user can
effectively express different support requirements for different data-sequences. MMS allow users
to have higher minimum supports for sequences involving high frequent items, and also allows
us to have lower minimum supports for sequences involving rare items. Given a sequence
Database S and a set of MIS values for all items in S, we discover all sequential patterns that
satisfy MIS (β).
4. THE PROPOSED ALGORITHM
Progressive Mining of sequential patterns with multiple minimum supports
For mining of sequential patterns in progressive databases with multiple minimum supports, we
propose another algorithm called MS-PISA. The algorithm maintains a tree based data structure
called MMP-tree which will store the knowledge of the database, where the database is
progressively incremented as the new transactions has been taken place. The tree also maintains
the percentage of participation of each element whenever the element is formed. Here we will
first discuss about MMP-tree in section 4.1.Next we explained in detail about MS-PISA in section
4.2.In section 4.3 we have taken an example database and discussed very clearly how the
algorithm MS-PISA works on the example.
4.1 MMP-tree
Here the MMP-tree will act as a basic kernel of MS-PISA. It not only stores the knowledge of all
the sequences in the database which progressively updates and also helps the algorithm to gather
frequent sequential patterns according to the multiple minimum supports in each POI. The nodes
in MMP-tree can be categorized in to two different categories. One is the root node and another is
common node. Root node is root of the MMP tree which does not stores any other information
but attached to common nodes as its children. Each common node gathers the knowledge about
three issues, one node label, second the sequence list of itemset, third the percentage of
participation of the itemset according to the minimum support of itemset.
Here the label is the same as the element in a sequence. The sequence list is the ID of the
sequence that contains the element to represent the sequence. The sequence ID in the sequence
list will be attached to the corresponding time stamp to help in deleting the obsolete data. It is
worth to mention that only the nodes in the first and the second levels have to maintain the
corresponding time stamps for the sequence IDs. The time stamps for the sequence IDs in the
nodes below the third level are the same as the time stamps for the same sequence IDs in the
second level. Therefore it is not necessary to store the duplicate information. Third along with
label, sequence ID the MMP-tree also stores the information about the percentage of participation
of the element according to the minimum support of the element. For example if the element is
AB then the tree finds out which item in the element is having less user defined minimum
support. If A is having less minimum support, then the percentage of participation will be
calculated with respect to A by using the formula POP=100/|Dbp,q
| * MIS(A) where |Dbp,q
| is no
of sequences in the time slot from time stamp p to time stamp q.
When ever in a particular sequence, if the element is appearing in a series wise then series of
nodes labeled by each element along with the percentage of participation will be gathered
respectively. The sequence ID will be same the sequence list. In the process of forming of the
International Journal of Computer Science, Engineering and Applications (IJCSEA) Vol.2, No.5, October 2012
67
MMP tree the first node will be connected to the root node with the element label and percentage
of participation of the element with respect to minimum support of the element. The second node
will be representing the following element along with the percentage of participation of that
particular element with respect to the minimum support of that element. The second node will be
attached to the first node which is already formed. The other nodes will be formed with
corresponding elements and their percentage of participation analogously. Note that in such a
way the path from root node to any other node will be representing the candidate sequential
pattern appearing in this sequence. The appearing time stamp for each candidate sequential
pattern will be marked in the node labeled by last element. If there is another sequence having
the same pattern, the sequence ID will be inserted into the sequence lists of the same nodes
labeled by these elements on the path along with the percentage of participation of the element.
On the other hand, if an element appearing in a sequence is obsolete, the corresponding sequence
ID will be removed from the sequence list of the node. When an obsolete node is removed the
percentage of participation of the element also will be removed. In addition, if a node has no
sequence in the sequence list, it will be pruned way from MMP tree. Thus we can ensure that
there are only up-to-date candidate sequential patterns in MMP tree. Now we will consider those
elements whose percentage of participation in more than 100% in total as frequent sequential
patterns from the above formed candidate sequential patterns, now we will demonstrate how to
maintain MMP-Tree in detail in the following section i.e. 1.2.
4.2 Algorithm MS-PISA
The main process in MS-PISA is to progressively update the knowledge of each sequence and
accumulate the percentage of participation of each element in the sequence. To construct the
candidate sequential pattern in progressive data base, the MS-PISA will use the MMP-Tree. For
this, the MMP-Tree mainly stores all the sequences along with the percentage of participation
corresponding to the minimum support of element that are falling in the sequences from one POI
to another. When receiving the elements at the arriving time stamp, say t+1, MS-PISA constructs
the original MMP-Tree of time stamp t in post order (children first, then the node itself) by
accumulating the POP and updates the MMP-Tree of time stamp t.
In the process of MMP-Tree, MS-PISA do the following things:
1. Deletes the obsolete elements along with the corresponding percentage of participation from.
2. Update the current sequences with the POP.
3. Calculate the POP of the new arriving elements and inserts them in to MMP-Tree of time
stamp t.
The detailed algorithm for POP is shown in the figure 3(a) and for MS-PISA in the figure 3(b),
3(c). In figure 3(b) from line 4 to line 7 MS-PISA collects the elements of all sequences along
with the POP of the elements at the running time stamps and constructs the MMP-Tree. Then
MS-PISA travels in forward direction to the next time stamp and collets the arriving elements in
the sequences along with the POP of the elements. This process continuous until there is no
further newly arriving elements in a progressive data base.
The main procedure constructs is used to construct the MMP-Tree of time stamp t and convert it
to the new MMP-Tree of time stamp t+1 along with the POP of the elements. The procedure
construct is shown in the fig. 3(c) and it was explained in detail with an example in fig.4 the
procedure construct MMP-Tree is in post order. That is MS-PISA goes to the children node and
deals with them first. After all the children nodes are constructed there MS-PISA constructs to
the node itself. The basic concept of procedure construct is to attach each newly arriving element
of all the sequences along with percentage of participation of the each arriving element with
respect to the minimum support of the element in to the MMP-Tree. There are two kinds of
International Journal of Computer Science, Engineering and Applications (IJCSEA) Vol.2, No.5, October 2012
68
nodes in MMP-Tree say root node and children nodes. For root node MS-PISA examines all
sequences which have new elements and calculate the percentage of participation of the element
with the formula 100 / |Dbp,q
|*MIS(el) where el :element. As shown in figure 4(t1), the sequences
S01, S02, S03 consisting of elements A,(AD) and A. The MS-PISA deals with root node first
from line 2 to line 11 shown in figure 3(c). If the node is root MS-PISA examines all newly
arriving elements in the eleSet including the POP from line 3 to line 11.MS-PISA has to create all
the combinations of candidate elements and calculate the POP of the combinations in newly
arriving data. For example if the arriving data contains (ABC), then the combinations will be A,
B, C, (AB), (AC), (BC) and (ABC) with their corresponding POP. For each element, if the
element already appeared in the last time stamp then MS-PISA checks for the sequence ID was
existed in that child node from Line 6 to 9.If it is existed in the list then MS-PISA updates the
time stamp of that sequence to new time stamp and calculate the POP of the element by
considering the MIS (el).Otherwise MS-PISA creates a new sequence with POP of the elements
and inserts the sequence in to the list of the sequence child. If there is no child node is available
with the same label, MS-PISA creates new child node with the corresponding sequence ID and
calculate the POP of the elements which are present in the node in line 11.
As shown in figure 4(t2) at node labeled by A ,MS-PISA checks the list of sequences of that node
to find if there is any newly arriving element of the sequences ,say 01,02 and 03.The
corresponding details are shown in figure 3(c) from line 12 to 27.If the upcoming node is not the
root node , MS-PISA first deletes the sequences as obsolete from the list of sequences along with
their corresponding POP from line 14 to 15.If there is no ID left in the list MS-PISA prunes way
the node along with the POP from parent and goes to next node from line 26 to 27.The MS-PISA
the left out sequences in the sequence list from line 16 to line 25 of figure 3(c). If there is a newly
arriving element in eleset, for all combinations of candidate elements in the arriving data MS-
PISA checks whether the element is on the path from root node in line 18.If it is not, that means
that there is new candidate sequential pattern. Then MS-PISA follows the procedure from line 19
to 25 and creates new sequences or Childs along with the percentage of participation of elements
same as from line 5 to 11.From line 26 to 27 we can note that if there is no sequence in a node
that node and it’s children along with POP should be deleted.
After processing all the common nodes (which means the process of constructing candidate
sequential patterns was completed) then MS-PISA examines the sequence ID’s for frequent
sequential patterns by considering the total POP of the nodes (including from root to child)
should be equal to 100 or more than 100.Now MS-PISA will give the path from root to child as
the output from line 28 to 29 of figure 3(c).In this way MS-PISA can find out all up-to-date
frequent sequential patterns of time stamp t+1after constructing the original MMP-tree of time
stamp t.
The process of calculating the POP will be as follows. As shown in figure 3(a) the algorithm POP
will check the elements from newly arriving data. If the element is having more than one item
then it will identify the item which is having less MIS value as shown from line 3 to 4 in figure
3(a). Then it will calculate the corresponding POP of the element described from line 6 to8 of
figure 3(a).
Consider the example shown in the figure 1 and the MIS values of A, B, C and D are 0.8, 0.5, 0.3
and 0.2 consecutively. The POP of A, B, C and D for 5 sequences are 25, 40, 66, and 100. The
figures 4 (a), 4(b), 4(c), 4(d), 4(e), 4(f), 4(g), 4(h) and 4(i) are the MMP-Trees of the example
sequence database shown in figure 1. In the same figures we have shown the frequent sequences.
The process was explained in detail in the next section.
International Journal of Computer Science, Engineering and Applications (IJCSEA) Vol.2, No.5, October 2012
69
Algorithm POP (Items, MIS(I))
1. var POP;
2. var I, Imin;// Items
3. For (all combinations of items in the ele);
4. If (No of items >1);
5. Check and identify the item Imin having less support;
6. POP (ele) =100/|Dbp,q
|*MIS(Imin): // |Dbp,q
| is no of sequences from p to q
7. else
8. POP (ele)=100/|Dbp,q
|*MIS(I): // |Dbp,q
| is no of sequences from p to q
END
Figure 3(a)
Algorithm MS-Pisa (POI)
1. var PS; // PS-Tree
2. var current Time; // timestamp now
3. var eleSet; // used to store elements ele
4. while (there is still new transaction)
5. eleSet = read all ele at current Time;
6. construct (current Time, PT);
7. current Time++
END
Figure 3(b)
Algorithm construct (current time, PS)
1. for (each node of PS in post order)
2. if (node is null)
3. for (ele of every seq in eleSet)
4. for (all combination of elements in the ele)
5. if (element==label of one of ele.child)
6. if (seq is in node.child.seq_list)
7. update timestamp of seq to currentTime, POP (ele);
8. else
9. create a new sequence with current Time, POP(ele);
10. else
11. create a new child with element, current Time, POP(ele);
12. else for (every seq in the seq_list)
13. if (seq.timestamp<= currentTime_POI)
14. delete seq, POP (el) from seq_list and continue to next seq;
15. if (there is new ele of the seq in eleSet)
16. for (all combination of elements in the ele)
18. if (element is not in the path from Root)
19. if (element==label of one of node.child)
20. if (seq is in child.seq_list)
21. child.seq_list.seq.timestamp= seq.timestamp;
22. else
23. create a new sequence with seq.timestamp, POP (ele);
24. else
25. create a new child with element, seq.timestamp, POP (ele);
28. if (seq_list.Total POP (el)>=100%)
29. output the ele as a sequential pattern;
END
Figure 3(c)
International Journal of Computer Science, Engineering and Applications (IJCSEA) Vol.2, No.5, October 2012
70
Figure 4(a)
Figure 4(b)
Figure 4(c)
International Journal of Computer Science, Engineering and Applications (IJCSEA) Vol.2, No.5, October 2012
71
5.RESULT
Our proposed algorithm is very effectively working for test dataset and we have
analyzed the test dataset for different parameters like period of interest, execution time etc.
5.1 Impact of period of interest (POI) on execution time
Execution time is the time required to execute all the instructions in the proposed algorithm. Here
we can observe that the execution time will increase a little bit with respect to the time taken for
identifying the item having minimum MIS value. For this we applied the quick sorting technique
.It can be noted that execution time is directly proportional to POI. The reason is that, the increase
in POI required more time for process as the no of new elements will be added to the existing
one.
Figure 5(a) shows the impact of POI over the execution time.
Figure 5(a)
5.2 Impact of POI over number of patterns
Number of patterns is dependent on period of interest a. As from F i g . 5 ( b ) we can see
that as the period of interest increases, the number of patterns also increases. This is because
as the period of interest increases, the algorithm has more items to process and so they give
more number of patterns.
Figure 5(b)
0
500
1000
1500
2000
2500
1 2 3 4 5 6 7 8
Timetocalculatethe
patterns(msec)
Time Vs Period Of Interest
time
0
20
40
60
80
100
1 2 3 4 5 6 7 8
NoOfPatterns
No Of Patterns Vs POI
No Of Patterns Vs POI
International Journal of Computer Science, Engineering and Applications (IJCSEA) Vol.2, No.5, October 2012
72
4.3 Comparison between MS-PISA and PISA in terms of time in progressive database
Generally mining frequent sequential patterns with multiple minimum supports will take more
time when compared with mining frequent sequences with uniform support. In case of percentage
of participation (POP), the time for execution of Ms-PISA will take little bit more time than the
PISA. The difference of time between these two will be the time for identifying the item having
minimum MIS value from the newly arriving element and accumulating the POP. We can
observe the difference in the Figure5(c)
Figure 5(c)
6. CONCLUSION
In the proposed novel work MS PISA algorithm is efficiently working by using the concept of
percentage of participation (POP) for mining frequent sequences in progressive databases with
multiple minimum supports. Even though the algorithm is taking little more time for execution
when compared with PISA (which finds the frequent sequential patterns in progressive databases
by considering uniform min.support for different items) it works on frequent sequences with
multiple minimum supports for different items. As the novel algorithm is finding the sequences
with multiple minimum supports we can ignore the little increase in time of execution. The
algorithm MS-PISA is very time efficient in that MS-PISA needs only one scan of the candidate
structure mentioned by MMP-Tree at each time stamp rather quadratic scans. In addition with the
novel design of MMP-Tree , MS-Pisa takes less space.
0
500
1000
1500
2000
2500
3000
1 2 3 4 5 6 7 8
TimeToCalculatePatterns(msec)
Time Variation for Ms-PISA vs PISA
Time for MS-PISA
Time PISA
International Journal of Computer Science, Engineering and Applications (IJCSEA) Vol.2, No.5, October 2012
73
7. REFERENCES
[1] R. Agrawal and R. Srikant, “Fast Algorithms for Mining Association Rules,” Proc. 20th Int’l Conf.
Very Large Data Bases (VLDB ’94), pp. 478-499, Sept. 1994.
[2] R. Agrawal and R. Srikant, “Mining Sequential Patterns,” Proc. 11th Int’l Conf. Data Eng. (ICDE
’95), pp. 3-14, Feb. 1995.
[3] R. Srikant and R. Agrawal, “Mining Sequential Patterns: Generalizations and Performance
Improvements,” Proc. Fifth Int’l Conf. Extending Database Technology (EDBT ’96), Mar. 1996.
[4] B.Liu, W.Hsu and Y.Ma, “Mining association rules with multiple minimum supports”, Proceedings of
the fifth ACM, SIGKDD conference Sandiego,CA,USA,August 15-18,1999,P.341
[5] M.J. Zaki, “Efficient Enumeration of Frequent Sequences,” Proc. Seventh ACM Int’l Conf.
Information and Knowledge Management (CIKM ’98), Nov. 1998.
[6] Jen-Wei Huang, Chi-Yao Tseng, Jian-Chih Ou, and Ming-Syan Chen, Fellow, IEEE: A General
Model for Sequential Pattern Mining with a Progressive Database IEEE Transactions On Knowledge
and Data Engineering VOL. 20, NO. 9, SEPTEMBER 2008
[7] S Cong, J. Han and D. Padua,“Parallel Mining of Closed Sequential Patterns,” Proc. 11th ACM
SIGKDD Int’l Conf. Knowl- edge Discovery and Data Mining (KDD ’05), pp. 562-567, Aug. 2005.
[8] J. Wang and J. Han, “Bide: Efficient Mining of Frequent Closed Sequences,” Proc. 20th Int’l
Conf. Data Eng. (ICDE ’04), pp. 79-91, 2004.
[9] X. Yan, J. Han, and R. Afshar, “Clospan: Mining Closed Sequential Patterns in Large Datasets,”
Proc. Third SIAM Int’l Conf. Data Mining (SDM ’03), pp. 166-177, May 2003.
[10] M.N. Garofalakis, R. Rastogi, and K. Shim, “Spirit: Sequential Pattern Mining with Regular
Expression Constraints,” Proc. 25th Int’l Conf. Very Large Data Bases (VLDB ’99), pp. 223-234,
1999.
[11] C. Luo and S.M. Chung, “Efficient Mining of Maximal Sequential Patterns Using Multiple
Samples,” Proc. Fifth SIAM Int’l Conf. Data Mining (SDM), 2005.
[12] H.Cao, N.Mamoulis, and D.W. Cheung, “Mining Frequent Spatio-Temporal Sequential Patterns,”
Proc. Fifth Int’l Conf. Data Mining (ICDM ’05), pp. 82-89, Nov. 2005.
[13] J.-K. Guo, B.-J. Ruan, and Y.-Y. Zhu, “A Top-Down Algorithm for Web Log Sequential Pattern
Mining,” Proc. Ninth Pacific-Asia Conf. Knowledge Discovery and Data Mining (PAKDD), 2005.
[14] C.-C. Ho, H.-F. Li, F.-F. Kuo, and S.-Y. Lee, “Incremental Mining of Sequential Patterns over a
Stream Sliding Window,” Proc. IEEE Int’l Workshop Mining Evolving and Streaming Data
IWMESD ’06), Dec. 2006
Ad

More Related Content

What's hot (20)

Ontology based clustering algorithms
Ontology based clustering algorithmsOntology based clustering algorithms
Ontology based clustering algorithms
Ikutwa
 
Graph based Approach and Clustering of Patterns (GACP) for Sequential Pattern...
Graph based Approach and Clustering of Patterns (GACP) for Sequential Pattern...Graph based Approach and Clustering of Patterns (GACP) for Sequential Pattern...
Graph based Approach and Clustering of Patterns (GACP) for Sequential Pattern...
AshishDPatel1
 
Document Classification Using Expectation Maximization with Semi Supervised L...
Document Classification Using Expectation Maximization with Semi Supervised L...Document Classification Using Expectation Maximization with Semi Supervised L...
Document Classification Using Expectation Maximization with Semi Supervised L...
ijsc
 
Ir 03
Ir   03Ir   03
Ir 03
Mohammed Romi
 
Multidimensioal database
Multidimensioal  databaseMultidimensioal  database
Multidimensioal database
TPO TPO
 
Progressive Mining of Sequential Patterns Based on Single Constraint
Progressive Mining of Sequential Patterns Based on Single ConstraintProgressive Mining of Sequential Patterns Based on Single Constraint
Progressive Mining of Sequential Patterns Based on Single Constraint
TELKOMNIKA JOURNAL
 
Text mining
Text miningText mining
Text mining
Pankaj Thakur
 
Ceis 1
Ceis 1Ceis 1
Ceis 1
Alexander Decker
 
An efficient algorithm for sequence generation in data mining
An efficient algorithm for sequence generation in data miningAn efficient algorithm for sequence generation in data mining
An efficient algorithm for sequence generation in data mining
ijcisjournal
 
Probabilistic retrieval model
Probabilistic retrieval modelProbabilistic retrieval model
Probabilistic retrieval model
baradhimarch81
 
Mcs 021
Mcs 021Mcs 021
Mcs 021
Ujjwal Kumar
 
Sensing Trending Topics in Twitter for Greater Jakarta Area
Sensing Trending Topics in Twitter for Greater Jakarta Area Sensing Trending Topics in Twitter for Greater Jakarta Area
Sensing Trending Topics in Twitter for Greater Jakarta Area
IJECEIAES
 
P33077080
P33077080P33077080
P33077080
IJERA Editor
 
Mining Regular Patterns in Data Streams Using Vertical Format
Mining Regular Patterns in Data Streams Using Vertical FormatMining Regular Patterns in Data Streams Using Vertical Format
Mining Regular Patterns in Data Streams Using Vertical Format
CSCJournals
 
G04124041046
G04124041046G04124041046
G04124041046
IOSR-JEN
 
Mcs 021 solve assignment
Mcs 021 solve assignmentMcs 021 solve assignment
Mcs 021 solve assignment
Indira Gnadhi National Open University (IGNOU)
 
Bi4101343346
Bi4101343346Bi4101343346
Bi4101343346
IJERA Editor
 
Bc26354358
Bc26354358Bc26354358
Bc26354358
IJERA Editor
 
Probabilistic models (part 1)
Probabilistic models (part 1)Probabilistic models (part 1)
Probabilistic models (part 1)
KU Leuven
 
Different Similarity Measures for Text Classification Using Knn
Different Similarity Measures for Text Classification Using KnnDifferent Similarity Measures for Text Classification Using Knn
Different Similarity Measures for Text Classification Using Knn
IOSR Journals
 
Ontology based clustering algorithms
Ontology based clustering algorithmsOntology based clustering algorithms
Ontology based clustering algorithms
Ikutwa
 
Graph based Approach and Clustering of Patterns (GACP) for Sequential Pattern...
Graph based Approach and Clustering of Patterns (GACP) for Sequential Pattern...Graph based Approach and Clustering of Patterns (GACP) for Sequential Pattern...
Graph based Approach and Clustering of Patterns (GACP) for Sequential Pattern...
AshishDPatel1
 
Document Classification Using Expectation Maximization with Semi Supervised L...
Document Classification Using Expectation Maximization with Semi Supervised L...Document Classification Using Expectation Maximization with Semi Supervised L...
Document Classification Using Expectation Maximization with Semi Supervised L...
ijsc
 
Multidimensioal database
Multidimensioal  databaseMultidimensioal  database
Multidimensioal database
TPO TPO
 
Progressive Mining of Sequential Patterns Based on Single Constraint
Progressive Mining of Sequential Patterns Based on Single ConstraintProgressive Mining of Sequential Patterns Based on Single Constraint
Progressive Mining of Sequential Patterns Based on Single Constraint
TELKOMNIKA JOURNAL
 
An efficient algorithm for sequence generation in data mining
An efficient algorithm for sequence generation in data miningAn efficient algorithm for sequence generation in data mining
An efficient algorithm for sequence generation in data mining
ijcisjournal
 
Probabilistic retrieval model
Probabilistic retrieval modelProbabilistic retrieval model
Probabilistic retrieval model
baradhimarch81
 
Sensing Trending Topics in Twitter for Greater Jakarta Area
Sensing Trending Topics in Twitter for Greater Jakarta Area Sensing Trending Topics in Twitter for Greater Jakarta Area
Sensing Trending Topics in Twitter for Greater Jakarta Area
IJECEIAES
 
Mining Regular Patterns in Data Streams Using Vertical Format
Mining Regular Patterns in Data Streams Using Vertical FormatMining Regular Patterns in Data Streams Using Vertical Format
Mining Regular Patterns in Data Streams Using Vertical Format
CSCJournals
 
G04124041046
G04124041046G04124041046
G04124041046
IOSR-JEN
 
Probabilistic models (part 1)
Probabilistic models (part 1)Probabilistic models (part 1)
Probabilistic models (part 1)
KU Leuven
 
Different Similarity Measures for Text Classification Using Knn
Different Similarity Measures for Text Classification Using KnnDifferent Similarity Measures for Text Classification Using Knn
Different Similarity Measures for Text Classification Using Knn
IOSR Journals
 

Similar to MMP-TREE FOR SEQUENTIAL PATTERN MINING WITH MULTIPLE MINIMUM SUPPORTS IN PROGRESSIVE DATABASES (20)

H0964752
H0964752H0964752
H0964752
IOSR Journals
 
Hadoop Map-Reduce To Generate Frequent Item Set on Large Datasets Using Impro...
Hadoop Map-Reduce To Generate Frequent Item Set on Large Datasets Using Impro...Hadoop Map-Reduce To Generate Frequent Item Set on Large Datasets Using Impro...
Hadoop Map-Reduce To Generate Frequent Item Set on Large Datasets Using Impro...
BRNSSPublicationHubI
 
Sequential Pattern Tree Mining
Sequential Pattern Tree MiningSequential Pattern Tree Mining
Sequential Pattern Tree Mining
IOSR Journals
 
Ijsrdv1 i2039
Ijsrdv1 i2039Ijsrdv1 i2039
Ijsrdv1 i2039
ijsrd.com
 
Ijariie1129
Ijariie1129Ijariie1129
Ijariie1129
IJARIIE JOURNAL
 
Implementation of Improved Apriori Algorithm on Large Dataset using Hadoop
Implementation of Improved Apriori Algorithm on Large Dataset using HadoopImplementation of Improved Apriori Algorithm on Large Dataset using Hadoop
Implementation of Improved Apriori Algorithm on Large Dataset using Hadoop
BRNSSPublicationHubI
 
A Survey of Sequential Rule Mining Techniques
A Survey of Sequential Rule Mining TechniquesA Survey of Sequential Rule Mining Techniques
A Survey of Sequential Rule Mining Techniques
ijsrd.com
 
SEARCH OF INFORMATION BASED CONTENT IN SEMI-STRUCTURED DOCUMENTS USING INTERF...
SEARCH OF INFORMATION BASED CONTENT IN SEMI-STRUCTURED DOCUMENTS USING INTERF...SEARCH OF INFORMATION BASED CONTENT IN SEMI-STRUCTURED DOCUMENTS USING INTERF...
SEARCH OF INFORMATION BASED CONTENT IN SEMI-STRUCTURED DOCUMENTS USING INTERF...
ijcsitcejournal
 
Mining closed sequential patterns in large sequence databases
Mining closed sequential patterns in large sequence databasesMining closed sequential patterns in large sequence databases
Mining closed sequential patterns in large sequence databases
IJDMS
 
Drsp dimension reduction for similarity matching and pruning of time series ...
Drsp  dimension reduction for similarity matching and pruning of time series ...Drsp  dimension reduction for similarity matching and pruning of time series ...
Drsp dimension reduction for similarity matching and pruning of time series ...
IJDKP
 
Review Over Sequential Rule Mining
Review Over Sequential Rule MiningReview Over Sequential Rule Mining
Review Over Sequential Rule Mining
ijsrd.com
 
Time Series Forecasting Using Novel Feature Extraction Algorithm and Multilay...
Time Series Forecasting Using Novel Feature Extraction Algorithm and Multilay...Time Series Forecasting Using Novel Feature Extraction Algorithm and Multilay...
Time Series Forecasting Using Novel Feature Extraction Algorithm and Multilay...
Editor IJCATR
 
A Brief Overview On Frequent Pattern Mining Algorithms
A Brief Overview On Frequent Pattern Mining AlgorithmsA Brief Overview On Frequent Pattern Mining Algorithms
A Brief Overview On Frequent Pattern Mining Algorithms
Sara Alvarez
 
International Journal of Engineering Research and Development
International Journal of Engineering Research and DevelopmentInternational Journal of Engineering Research and Development
International Journal of Engineering Research and Development
IJERD Editor
 
E43022023
E43022023E43022023
E43022023
IJERA Editor
 
Modern association rule mining methods
Modern association rule mining methodsModern association rule mining methods
Modern association rule mining methods
ijcsity
 
Mining sequential patterns for interval based
Mining sequential patterns for interval basedMining sequential patterns for interval based
Mining sequential patterns for interval based
ijcsa
 
Efficient Parallel Pruning of Associative Rules with Optimized Search
Efficient Parallel Pruning of Associative Rules with Optimized  SearchEfficient Parallel Pruning of Associative Rules with Optimized  Search
Efficient Parallel Pruning of Associative Rules with Optimized Search
IOSR Journals
 
Mining Users Rare Sequential Topic Patterns from Tweets based on Topic Extrac...
Mining Users Rare Sequential Topic Patterns from Tweets based on Topic Extrac...Mining Users Rare Sequential Topic Patterns from Tweets based on Topic Extrac...
Mining Users Rare Sequential Topic Patterns from Tweets based on Topic Extrac...
IRJET Journal
 
PSF_Introduction_to_R_Package_for_Pattern_Sequence (1)
PSF_Introduction_to_R_Package_for_Pattern_Sequence (1)PSF_Introduction_to_R_Package_for_Pattern_Sequence (1)
PSF_Introduction_to_R_Package_for_Pattern_Sequence (1)
neeraj7svp
 
Hadoop Map-Reduce To Generate Frequent Item Set on Large Datasets Using Impro...
Hadoop Map-Reduce To Generate Frequent Item Set on Large Datasets Using Impro...Hadoop Map-Reduce To Generate Frequent Item Set on Large Datasets Using Impro...
Hadoop Map-Reduce To Generate Frequent Item Set on Large Datasets Using Impro...
BRNSSPublicationHubI
 
Sequential Pattern Tree Mining
Sequential Pattern Tree MiningSequential Pattern Tree Mining
Sequential Pattern Tree Mining
IOSR Journals
 
Ijsrdv1 i2039
Ijsrdv1 i2039Ijsrdv1 i2039
Ijsrdv1 i2039
ijsrd.com
 
Implementation of Improved Apriori Algorithm on Large Dataset using Hadoop
Implementation of Improved Apriori Algorithm on Large Dataset using HadoopImplementation of Improved Apriori Algorithm on Large Dataset using Hadoop
Implementation of Improved Apriori Algorithm on Large Dataset using Hadoop
BRNSSPublicationHubI
 
A Survey of Sequential Rule Mining Techniques
A Survey of Sequential Rule Mining TechniquesA Survey of Sequential Rule Mining Techniques
A Survey of Sequential Rule Mining Techniques
ijsrd.com
 
SEARCH OF INFORMATION BASED CONTENT IN SEMI-STRUCTURED DOCUMENTS USING INTERF...
SEARCH OF INFORMATION BASED CONTENT IN SEMI-STRUCTURED DOCUMENTS USING INTERF...SEARCH OF INFORMATION BASED CONTENT IN SEMI-STRUCTURED DOCUMENTS USING INTERF...
SEARCH OF INFORMATION BASED CONTENT IN SEMI-STRUCTURED DOCUMENTS USING INTERF...
ijcsitcejournal
 
Mining closed sequential patterns in large sequence databases
Mining closed sequential patterns in large sequence databasesMining closed sequential patterns in large sequence databases
Mining closed sequential patterns in large sequence databases
IJDMS
 
Drsp dimension reduction for similarity matching and pruning of time series ...
Drsp  dimension reduction for similarity matching and pruning of time series ...Drsp  dimension reduction for similarity matching and pruning of time series ...
Drsp dimension reduction for similarity matching and pruning of time series ...
IJDKP
 
Review Over Sequential Rule Mining
Review Over Sequential Rule MiningReview Over Sequential Rule Mining
Review Over Sequential Rule Mining
ijsrd.com
 
Time Series Forecasting Using Novel Feature Extraction Algorithm and Multilay...
Time Series Forecasting Using Novel Feature Extraction Algorithm and Multilay...Time Series Forecasting Using Novel Feature Extraction Algorithm and Multilay...
Time Series Forecasting Using Novel Feature Extraction Algorithm and Multilay...
Editor IJCATR
 
A Brief Overview On Frequent Pattern Mining Algorithms
A Brief Overview On Frequent Pattern Mining AlgorithmsA Brief Overview On Frequent Pattern Mining Algorithms
A Brief Overview On Frequent Pattern Mining Algorithms
Sara Alvarez
 
International Journal of Engineering Research and Development
International Journal of Engineering Research and DevelopmentInternational Journal of Engineering Research and Development
International Journal of Engineering Research and Development
IJERD Editor
 
Modern association rule mining methods
Modern association rule mining methodsModern association rule mining methods
Modern association rule mining methods
ijcsity
 
Mining sequential patterns for interval based
Mining sequential patterns for interval basedMining sequential patterns for interval based
Mining sequential patterns for interval based
ijcsa
 
Efficient Parallel Pruning of Associative Rules with Optimized Search
Efficient Parallel Pruning of Associative Rules with Optimized  SearchEfficient Parallel Pruning of Associative Rules with Optimized  Search
Efficient Parallel Pruning of Associative Rules with Optimized Search
IOSR Journals
 
Mining Users Rare Sequential Topic Patterns from Tweets based on Topic Extrac...
Mining Users Rare Sequential Topic Patterns from Tweets based on Topic Extrac...Mining Users Rare Sequential Topic Patterns from Tweets based on Topic Extrac...
Mining Users Rare Sequential Topic Patterns from Tweets based on Topic Extrac...
IRJET Journal
 
PSF_Introduction_to_R_Package_for_Pattern_Sequence (1)
PSF_Introduction_to_R_Package_for_Pattern_Sequence (1)PSF_Introduction_to_R_Package_for_Pattern_Sequence (1)
PSF_Introduction_to_R_Package_for_Pattern_Sequence (1)
neeraj7svp
 
Ad

Recently uploaded (20)

Deepfake Phishing: A New Frontier in Cyber Threats
Deepfake Phishing: A New Frontier in Cyber ThreatsDeepfake Phishing: A New Frontier in Cyber Threats
Deepfake Phishing: A New Frontier in Cyber Threats
RaviKumar256934
 
Personal Protective Efsgfgsffquipment.ppt
Personal Protective Efsgfgsffquipment.pptPersonal Protective Efsgfgsffquipment.ppt
Personal Protective Efsgfgsffquipment.ppt
ganjangbegu579
 
Environment .................................
Environment .................................Environment .................................
Environment .................................
shadyozq9
 
OPTIMIZING DATA INTEROPERABILITY IN AGILE ORGANIZATIONS: INTEGRATING NONAKA’S...
OPTIMIZING DATA INTEROPERABILITY IN AGILE ORGANIZATIONS: INTEGRATING NONAKA’S...OPTIMIZING DATA INTEROPERABILITY IN AGILE ORGANIZATIONS: INTEGRATING NONAKA’S...
OPTIMIZING DATA INTEROPERABILITY IN AGILE ORGANIZATIONS: INTEGRATING NONAKA’S...
ijdmsjournal
 
Generative AI & Large Language Models Agents
Generative AI & Large Language Models AgentsGenerative AI & Large Language Models Agents
Generative AI & Large Language Models Agents
aasgharbee22seecs
 
Using the Artificial Neural Network to Predict the Axial Strength and Strain ...
Using the Artificial Neural Network to Predict the Axial Strength and Strain ...Using the Artificial Neural Network to Predict the Axial Strength and Strain ...
Using the Artificial Neural Network to Predict the Axial Strength and Strain ...
Journal of Soft Computing in Civil Engineering
 
Optimizing Reinforced Concrete Cantilever Retaining Walls Using Gases Brownia...
Optimizing Reinforced Concrete Cantilever Retaining Walls Using Gases Brownia...Optimizing Reinforced Concrete Cantilever Retaining Walls Using Gases Brownia...
Optimizing Reinforced Concrete Cantilever Retaining Walls Using Gases Brownia...
Journal of Soft Computing in Civil Engineering
 
acid base ppt and their specific application in food
acid base ppt and their specific application in foodacid base ppt and their specific application in food
acid base ppt and their specific application in food
Fatehatun Noor
 
AI Chatbots & Software Development Teams
AI Chatbots & Software Development TeamsAI Chatbots & Software Development Teams
AI Chatbots & Software Development Teams
Joe Krall
 
Applications of Centroid in Structural Engineering
Applications of Centroid in Structural EngineeringApplications of Centroid in Structural Engineering
Applications of Centroid in Structural Engineering
suvrojyotihalder2006
 
twin tower attack 2001 new york city
twin  tower  attack  2001 new  york citytwin  tower  attack  2001 new  york city
twin tower attack 2001 new york city
harishreemavs
 
introduction technology technology tec.pptx
introduction technology technology tec.pptxintroduction technology technology tec.pptx
introduction technology technology tec.pptx
Iftikhar70
 
Water Industry Process Automation & Control Monthly May 2025
Water Industry Process Automation & Control Monthly May 2025Water Industry Process Automation & Control Monthly May 2025
Water Industry Process Automation & Control Monthly May 2025
Water Industry Process Automation & Control
 
Design Optimization of Reinforced Concrete Waffle Slab Using Genetic Algorithm
Design Optimization of Reinforced Concrete Waffle Slab Using Genetic AlgorithmDesign Optimization of Reinforced Concrete Waffle Slab Using Genetic Algorithm
Design Optimization of Reinforced Concrete Waffle Slab Using Genetic Algorithm
Journal of Soft Computing in Civil Engineering
 
Working with USDOT UTCs: From Conception to Implementation
Working with USDOT UTCs: From Conception to ImplementationWorking with USDOT UTCs: From Conception to Implementation
Working with USDOT UTCs: From Conception to Implementation
Alabama Transportation Assistance Program
 
Slide share PPT of SOx control technologies.pptx
Slide share PPT of SOx control technologies.pptxSlide share PPT of SOx control technologies.pptx
Slide share PPT of SOx control technologies.pptx
vvsasane
 
Frontend Architecture Diagram/Guide For Frontend Engineers
Frontend Architecture Diagram/Guide For Frontend EngineersFrontend Architecture Diagram/Guide For Frontend Engineers
Frontend Architecture Diagram/Guide For Frontend Engineers
Michael Hertzberg
 
David Boutry - Specializes In AWS, Microservices And Python.pdf
David Boutry - Specializes In AWS, Microservices And Python.pdfDavid Boutry - Specializes In AWS, Microservices And Python.pdf
David Boutry - Specializes In AWS, Microservices And Python.pdf
David Boutry
 
Machine Learning basics POWERPOINT PRESENETATION
Machine Learning basics POWERPOINT PRESENETATIONMachine Learning basics POWERPOINT PRESENETATION
Machine Learning basics POWERPOINT PRESENETATION
DarrinBright1
 
Automatic Quality Assessment for Speech and Beyond
Automatic Quality Assessment for Speech and BeyondAutomatic Quality Assessment for Speech and Beyond
Automatic Quality Assessment for Speech and Beyond
NU_I_TODALAB
 
Deepfake Phishing: A New Frontier in Cyber Threats
Deepfake Phishing: A New Frontier in Cyber ThreatsDeepfake Phishing: A New Frontier in Cyber Threats
Deepfake Phishing: A New Frontier in Cyber Threats
RaviKumar256934
 
Personal Protective Efsgfgsffquipment.ppt
Personal Protective Efsgfgsffquipment.pptPersonal Protective Efsgfgsffquipment.ppt
Personal Protective Efsgfgsffquipment.ppt
ganjangbegu579
 
Environment .................................
Environment .................................Environment .................................
Environment .................................
shadyozq9
 
OPTIMIZING DATA INTEROPERABILITY IN AGILE ORGANIZATIONS: INTEGRATING NONAKA’S...
OPTIMIZING DATA INTEROPERABILITY IN AGILE ORGANIZATIONS: INTEGRATING NONAKA’S...OPTIMIZING DATA INTEROPERABILITY IN AGILE ORGANIZATIONS: INTEGRATING NONAKA’S...
OPTIMIZING DATA INTEROPERABILITY IN AGILE ORGANIZATIONS: INTEGRATING NONAKA’S...
ijdmsjournal
 
Generative AI & Large Language Models Agents
Generative AI & Large Language Models AgentsGenerative AI & Large Language Models Agents
Generative AI & Large Language Models Agents
aasgharbee22seecs
 
acid base ppt and their specific application in food
acid base ppt and their specific application in foodacid base ppt and their specific application in food
acid base ppt and their specific application in food
Fatehatun Noor
 
AI Chatbots & Software Development Teams
AI Chatbots & Software Development TeamsAI Chatbots & Software Development Teams
AI Chatbots & Software Development Teams
Joe Krall
 
Applications of Centroid in Structural Engineering
Applications of Centroid in Structural EngineeringApplications of Centroid in Structural Engineering
Applications of Centroid in Structural Engineering
suvrojyotihalder2006
 
twin tower attack 2001 new york city
twin  tower  attack  2001 new  york citytwin  tower  attack  2001 new  york city
twin tower attack 2001 new york city
harishreemavs
 
introduction technology technology tec.pptx
introduction technology technology tec.pptxintroduction technology technology tec.pptx
introduction technology technology tec.pptx
Iftikhar70
 
Slide share PPT of SOx control technologies.pptx
Slide share PPT of SOx control technologies.pptxSlide share PPT of SOx control technologies.pptx
Slide share PPT of SOx control technologies.pptx
vvsasane
 
Frontend Architecture Diagram/Guide For Frontend Engineers
Frontend Architecture Diagram/Guide For Frontend EngineersFrontend Architecture Diagram/Guide For Frontend Engineers
Frontend Architecture Diagram/Guide For Frontend Engineers
Michael Hertzberg
 
David Boutry - Specializes In AWS, Microservices And Python.pdf
David Boutry - Specializes In AWS, Microservices And Python.pdfDavid Boutry - Specializes In AWS, Microservices And Python.pdf
David Boutry - Specializes In AWS, Microservices And Python.pdf
David Boutry
 
Machine Learning basics POWERPOINT PRESENETATION
Machine Learning basics POWERPOINT PRESENETATIONMachine Learning basics POWERPOINT PRESENETATION
Machine Learning basics POWERPOINT PRESENETATION
DarrinBright1
 
Automatic Quality Assessment for Speech and Beyond
Automatic Quality Assessment for Speech and BeyondAutomatic Quality Assessment for Speech and Beyond
Automatic Quality Assessment for Speech and Beyond
NU_I_TODALAB
 
Ad

MMP-TREE FOR SEQUENTIAL PATTERN MINING WITH MULTIPLE MINIMUM SUPPORTS IN PROGRESSIVE DATABASES

  • 1. International Journal of Computer Science, Engineering and Applications (IJCSEA) Vol.2, No.5, October 2012 DOI : 10.5121/ijcsea.2012.2506 61 MMP-TREE FOR SEQUENTIAL PATTERN MINING WITH MULTIPLE MINIMUM SUPPORTS IN PROGRESSIVE DATABASES K.M.V.Madan Kumar1 , P.V.S.Srinivas2 and C.Raghavendra Rao3 1 Research Scholar, CMJ University, Shillong, Meghalaya, India Assoc Professor, TKR College of Engg &Tech, Hyderabad, AP, India 2 Professor, Geethanjali College of Engineering & Technology, Hyderabad, India 3 Professor, Dept. of CIS, University of Hyderabad, Hyderabad, India ABSTRACT Although there may be lot of research work done on sequential pattern mining in static, incremental, progressive databases, the previous work do not fully concentrating on support issues. Most of the previous approaches set a single minimum support threshold for all the items or item sets. But in real world applications different items may have different support threshold to describe whether a given item or item set is a frequent item set. This means each item will contain its own support threshold depends upon various issues like cost of item, environmental factors etc. In this work we proposed a new approach which can be applied on any algorithm independent of that whether the particular algorithm may or may not use the process of generating the candidate sets for identifying the frequent item sets. The proposed algorithm will use the concept of “percentage of participation” instead of occurrence frequency for every possible combination of items or item sets. The concept of percentage of participation will be calculated based on the minimum support threshold for each item set. Our algorithm MS-PISA , which stands for Multiple Support Progressive mIning Of Sequential pAtterns , which discovers sequential patterns in by considering different multiple minimum support threshold values for every possible combinations of item or item sets. KEYWORDS Multiple minimum supports, Progressive sequential pattern, Percentage of participation 1. INTRODUCTION Sequential Pattern mining is one of the most important research issues in data mining. Which was first introduced by Agarwal and Srikanth [2] and can be described as follows: we are given a set of data sequences which will be used as input data. Each sequence consists of a list of ordered item sets containing a set of different items. The sequential pattern mining finds all Sub sequences with frequencies lower than min support which is given by user. After that there have been a significant research work was done by various researchers on sequential pattern mining and defined number of algorithms not only for static data base [2],[3] Where data do not change over time but also for incremental data bases[6] where there will be new data arriving as the time goes by. In addition to above researchers some other researchers derived other kind of sequential patterns like closed sequential patterns. Constraint sequential pattern, maximal sequential patterns, spatio temporal sequential patterns. Sequential pattern on specific type of data, on stream data.
  • 2. International Journal of Computer Science, Engineering and Applications (IJCSEA) Vol.2, No.5, October 2012 62 The sequential patterns mining in progressive data bases will be having a vide use in many applications. For example in mobile applications, share market. Moreover the stock price changes of the company before five years ago may have very little influence on the quotes of other stocks. To remedy the problem of sequential pattern mining in progressive databases, Haung.et.al [6] proposed an algorithm PISA which takes the concept of period of Interest (POI) but Haung.et.al[6] also considered only uniform min.sup like all the previous researchers. But considering uniform min.sup, implicitly assumes that all items in the data base have similar frequency. However some items may appear very frequently in the data base while other rarely appears. Under such circumstances, if we set the value of min support too high we will not find those rules involving rare items in the data base. On the other hand if we set that value too low, it will generate huge amounts of meaningless patterns. Therefore liue .et.al [4] first address this and propose the concept of multiple minimum supports (MMS in Short)in association rule mining. In this study we first extend the definition of sequential pattern by considering the concept of MMS, which allows users to specify multiple min.sup for each item. Items with low frequencies will be specified with lower minimum support and pattern involving these items can be easily retrieved for further decision support. But the major problem on frequent pattern mining with MMS is that the downward closure property no longer holds in the mining process, which means that a super-pattern of an infrequent pattern might be frequent pattern. To effectively reduce the search space in a level –wise methods, Liu.et.al proposes the sorted closure property, where all item in data sets are sorted in ascending order by their MIS values. The sorted closure property however is invalid in sequential pattern mining since the order in the data sequences cannot be altered. Therefore to discover complete set of sequential patterns with MMS is not straight forward. Based on the new definitions of sequential patterns with MMS, we first proposed the concept called Percentage of participation (POP) where POP is the percentage of participation of the item or item set with respect to the no of sequences in the POI i.e. |Db| and minimum item support (MIS). The structure of this paper as follows. In section 2, we first review the concept of PISA as the basis of our approach. Section 3 introduces the definition of the progressive sequential pattern mining with multiple minimum supports (MMS) by involving the Percentage of participation (POP).Our algorithm called MS-PISA will be discussed in section 4 and we have conclusion in section 5. 2. RELATED WORK Haung.et.al proposed algorithm which works for mining of sequential patterns in progressive data bases. The input of progressive sequential pattern mining is a user specified length of POI and user-defined minimum support threshold. POI is a sliding window protocol whose length is a user specified time interval continuously advances as the time advances. The sequences having elements whose time stamps are within the POI will be considered to contribute for |Db| for current sequential patterns (where |Db| is the number of sequences in the POI). The sequences having elements with time stamps older than the POI will be considered as absolute and removed. In this they consider an element which is set of items appeared in a sequence with the time stamp denoted by p, q. When the sequence database is denoted by Db then a sequence in a sequence database is a set of elements ordered by time stamp increasingly. They also introduced a time interval denoted by [p, q] to represent the time interval between time stamp p and q. Therefore Dbp,q is defined as a subset of the databases Db containing the elements of sequences from time stamp p to time stamp q. They also aimed finding the sequential patterns containing no repeated elements and having more than one element. In the beginning, the mining algorithm receives new data of different timestamps with a user-defined minimum support threshold and the length of
  • 3. International Journal of Computer Science, Engineering and Applications (IJCSEA) Vol.2, No.5, October 2012 63 POI. For each POI, the mining algorithm progressively updates the sequences in the database. Then, the algorithm outputs frequent sequential patterns which are qualified by the uniform minimum support threshold. Note that the mining algorithm should prune away obsolete data from the sequence database, delete obsolete sequential patterns rapidly, and update |Db| with the number of sequences which have elements in the urrent POI. The process will be continuously executed until there is no more newly arriving data. To solve the progressive sequential pattern mining algorithm, Haung.et.al proposed a progressive mining algorithm PISA. Where PISA maintains a PS-tree to keep the information of the progressive database and up-to-date sequential patterns in each POI. The nodes in PS-tree can be divided into two different types. They are root node and common nodes. Root node is the root of PS-tree containing nothing but a list of common nodes as its children. Each common node stores two information, say node label and a sequence list, as shown in the top-left side of Fig. 2. The label is the same as the element in a sequence. The sequence list stores a list of sequence IDs to represent the sequences containing this element. Each sequence ID in the sequence list is marked by a corresponding timestamp. Whenever there are a series of elements appearing in the same sequence, there will be a series of nodes labeled by each element, respectively, with the same sequence IDs in their sequence lists. Then, the first node will be connected to the root node and the second node representing the following element will be connected to the first node. The other nodes will be connected analogously. Note that, in such a way, the path from root node to any other node will represent the candidate sequential pattern appearing in this sequence. The appearing timestamp for each candidate sequential pattern will be marked in the node labeled by the last element. If there is another sequence having the same pattern, the sequence ID will be inserted into the sequence lists of the same nodes labeled by these elements on the path. On the other hand, if an element appearing in a sequence is obsolete, the corresponding sequence ID will be removed from the sequence list of the node. In addition, if a node has no sequence in the sequence list, it will be pruned away from PS-tree. Figure 1 For the above example shown in figure1 the algorithm PISA constructs the candidate sets as follows which is shown in figure2.
  • 4. International Journal of Computer Science, Engineering and Applications (IJCSEA) Vol.2, No.5, October 2012 64 Figure 2 Before this many researchers have developed various methods to find frequent sequential patterns with a static database. Apriori All [2], GSP [3] are the milestones of sequential pattern mining algorithms based on traditional association rules mining technique. SPADE [5], illustrated by Zaki, systematically searched the sequence lattice spanned by the subsequence relation. Han et al. and Pie et al. brought up Free Span and Prefix Span, which found sequential patterns by constructing sub databases of the entire database. Zhang et al. proposed two algorithms GSP+ and MFS+ based on static algorithms GSP and MFS (also derived by the same authors).But as discussed early in section1 all the researchers considered uniform min.sup to describe whether the sequence is frequent or not.
  • 5. International Journal of Computer Science, Engineering and Applications (IJCSEA) Vol.2, No.5, October 2012 65 3. PROBLEM DEFINITION In this section, we formally give definitions used in the discovery of sequential pattern with MMS. Let I denote the set of items in the database, and a subset of I is called an item set. A customer’s data-sequence is an ordered list of items with time stamps. Therefore, a sequence, say α, can be represented as <(a1: t1), (a2: t2), (a3: t3), …, (an: tn)>, where a j is an item, and tj stands for the time when aj occurs, 1≤ j≤n, and tj-1≤tj for 2≤j≤ n. If several items occur at the same time in tj for 2 j the sequence, they are ordered alphabetically. Definition 1. Given an item set Iq = (i1, i2...im), we say item set Iq occurs in α if integers 1≤k1 < k2 <…< km ≤n exist such that, i1 = ak1, i2 = ak2,... ,im = akm and tk1= tk2=...= tkm. We refer to k1 and tk1 jointly as the position and the time that Iq occurs in α, respectively. Definition 2. Let β = <I1, I2...Is> (Iq subset or equal to I for 1 ≤q≤ s) be a sequence of item set. Assume that each Iq in β occurs in α. Then we say sequence β occurs in α, or is a subsequence of α if tI1 < tI2<...< tIs, where tIq (1≤ q ≤s) is the time, at which Iq occurs in α. Definition 3. A sequence database S is formed by a set of records <sid, s >, where sid is the identifier of this data-sequence and s is a data-sequence. For a given sequence β, the support count of sequence β in S are defined as follows: supp (β) = |{(sid, s)| (sid, s) ЄS ^β is a subsequence in s}| The following definitions are related to the concept of MMS. In this model, the definition of the minimum support is changed. Each item in the database can have its minsup, which is expressed in terms of minimum item support (MIS). In other words, users can specify different MIS values for different items. Definition 4: Let MIS (i) denote the MIS value of item i (i subset or equal to I). Given an item set Iq = (i1,i2,...,im), the MIS value of item set Iq = (i1,i2,...,im) (1≤ k ≤m), denoted as MIS(Iq), is equal to: min [MIS (i1), MIS(i2),…, MIS(im)] Definition 5: Given a sequence β = <I1.I2...Is> (Iq subset or equal to I for 1≤ q≤ s), the minimum support threshold of β, denoted as MIS (β), is equal to: min [ MIS(I1), MIS(I2),…, MIS(Is)] Definition 6: Given a sequence database S and a sequence β, we call β is frequent in S or β is a sequential pattern in S if supp (β) ≥MIS (β). Definition 7: Progressive sequential pattern mining problem. “Given a user-specified length of POI and a user- defined minimum support threshold, find the complete set of frequent subsequences whose occurrence frequencies are greater than or equal to the minimum support times the number of sequences in the recent POI of a progressive database.”
  • 6. International Journal of Computer Science, Engineering and Applications (IJCSEA) Vol.2, No.5, October 2012 66 Definition 8: Let I be an item or item set and MIS (I) be the minimum item support threshold of item I. |Dbp,q | is the number of sequences In the POI. Timestamps between p ,q. Percentage Of Participation(POP)= 100/ |Dbp,q |* MIS(I) With the provision of different minimum item support thresholds for different items, user can effectively express different support requirements for different data-sequences. MMS allow users to have higher minimum supports for sequences involving high frequent items, and also allows us to have lower minimum supports for sequences involving rare items. Given a sequence Database S and a set of MIS values for all items in S, we discover all sequential patterns that satisfy MIS (β). 4. THE PROPOSED ALGORITHM Progressive Mining of sequential patterns with multiple minimum supports For mining of sequential patterns in progressive databases with multiple minimum supports, we propose another algorithm called MS-PISA. The algorithm maintains a tree based data structure called MMP-tree which will store the knowledge of the database, where the database is progressively incremented as the new transactions has been taken place. The tree also maintains the percentage of participation of each element whenever the element is formed. Here we will first discuss about MMP-tree in section 4.1.Next we explained in detail about MS-PISA in section 4.2.In section 4.3 we have taken an example database and discussed very clearly how the algorithm MS-PISA works on the example. 4.1 MMP-tree Here the MMP-tree will act as a basic kernel of MS-PISA. It not only stores the knowledge of all the sequences in the database which progressively updates and also helps the algorithm to gather frequent sequential patterns according to the multiple minimum supports in each POI. The nodes in MMP-tree can be categorized in to two different categories. One is the root node and another is common node. Root node is root of the MMP tree which does not stores any other information but attached to common nodes as its children. Each common node gathers the knowledge about three issues, one node label, second the sequence list of itemset, third the percentage of participation of the itemset according to the minimum support of itemset. Here the label is the same as the element in a sequence. The sequence list is the ID of the sequence that contains the element to represent the sequence. The sequence ID in the sequence list will be attached to the corresponding time stamp to help in deleting the obsolete data. It is worth to mention that only the nodes in the first and the second levels have to maintain the corresponding time stamps for the sequence IDs. The time stamps for the sequence IDs in the nodes below the third level are the same as the time stamps for the same sequence IDs in the second level. Therefore it is not necessary to store the duplicate information. Third along with label, sequence ID the MMP-tree also stores the information about the percentage of participation of the element according to the minimum support of the element. For example if the element is AB then the tree finds out which item in the element is having less user defined minimum support. If A is having less minimum support, then the percentage of participation will be calculated with respect to A by using the formula POP=100/|Dbp,q | * MIS(A) where |Dbp,q | is no of sequences in the time slot from time stamp p to time stamp q. When ever in a particular sequence, if the element is appearing in a series wise then series of nodes labeled by each element along with the percentage of participation will be gathered respectively. The sequence ID will be same the sequence list. In the process of forming of the
  • 7. International Journal of Computer Science, Engineering and Applications (IJCSEA) Vol.2, No.5, October 2012 67 MMP tree the first node will be connected to the root node with the element label and percentage of participation of the element with respect to minimum support of the element. The second node will be representing the following element along with the percentage of participation of that particular element with respect to the minimum support of that element. The second node will be attached to the first node which is already formed. The other nodes will be formed with corresponding elements and their percentage of participation analogously. Note that in such a way the path from root node to any other node will be representing the candidate sequential pattern appearing in this sequence. The appearing time stamp for each candidate sequential pattern will be marked in the node labeled by last element. If there is another sequence having the same pattern, the sequence ID will be inserted into the sequence lists of the same nodes labeled by these elements on the path along with the percentage of participation of the element. On the other hand, if an element appearing in a sequence is obsolete, the corresponding sequence ID will be removed from the sequence list of the node. When an obsolete node is removed the percentage of participation of the element also will be removed. In addition, if a node has no sequence in the sequence list, it will be pruned way from MMP tree. Thus we can ensure that there are only up-to-date candidate sequential patterns in MMP tree. Now we will consider those elements whose percentage of participation in more than 100% in total as frequent sequential patterns from the above formed candidate sequential patterns, now we will demonstrate how to maintain MMP-Tree in detail in the following section i.e. 1.2. 4.2 Algorithm MS-PISA The main process in MS-PISA is to progressively update the knowledge of each sequence and accumulate the percentage of participation of each element in the sequence. To construct the candidate sequential pattern in progressive data base, the MS-PISA will use the MMP-Tree. For this, the MMP-Tree mainly stores all the sequences along with the percentage of participation corresponding to the minimum support of element that are falling in the sequences from one POI to another. When receiving the elements at the arriving time stamp, say t+1, MS-PISA constructs the original MMP-Tree of time stamp t in post order (children first, then the node itself) by accumulating the POP and updates the MMP-Tree of time stamp t. In the process of MMP-Tree, MS-PISA do the following things: 1. Deletes the obsolete elements along with the corresponding percentage of participation from. 2. Update the current sequences with the POP. 3. Calculate the POP of the new arriving elements and inserts them in to MMP-Tree of time stamp t. The detailed algorithm for POP is shown in the figure 3(a) and for MS-PISA in the figure 3(b), 3(c). In figure 3(b) from line 4 to line 7 MS-PISA collects the elements of all sequences along with the POP of the elements at the running time stamps and constructs the MMP-Tree. Then MS-PISA travels in forward direction to the next time stamp and collets the arriving elements in the sequences along with the POP of the elements. This process continuous until there is no further newly arriving elements in a progressive data base. The main procedure constructs is used to construct the MMP-Tree of time stamp t and convert it to the new MMP-Tree of time stamp t+1 along with the POP of the elements. The procedure construct is shown in the fig. 3(c) and it was explained in detail with an example in fig.4 the procedure construct MMP-Tree is in post order. That is MS-PISA goes to the children node and deals with them first. After all the children nodes are constructed there MS-PISA constructs to the node itself. The basic concept of procedure construct is to attach each newly arriving element of all the sequences along with percentage of participation of the each arriving element with respect to the minimum support of the element in to the MMP-Tree. There are two kinds of
  • 8. International Journal of Computer Science, Engineering and Applications (IJCSEA) Vol.2, No.5, October 2012 68 nodes in MMP-Tree say root node and children nodes. For root node MS-PISA examines all sequences which have new elements and calculate the percentage of participation of the element with the formula 100 / |Dbp,q |*MIS(el) where el :element. As shown in figure 4(t1), the sequences S01, S02, S03 consisting of elements A,(AD) and A. The MS-PISA deals with root node first from line 2 to line 11 shown in figure 3(c). If the node is root MS-PISA examines all newly arriving elements in the eleSet including the POP from line 3 to line 11.MS-PISA has to create all the combinations of candidate elements and calculate the POP of the combinations in newly arriving data. For example if the arriving data contains (ABC), then the combinations will be A, B, C, (AB), (AC), (BC) and (ABC) with their corresponding POP. For each element, if the element already appeared in the last time stamp then MS-PISA checks for the sequence ID was existed in that child node from Line 6 to 9.If it is existed in the list then MS-PISA updates the time stamp of that sequence to new time stamp and calculate the POP of the element by considering the MIS (el).Otherwise MS-PISA creates a new sequence with POP of the elements and inserts the sequence in to the list of the sequence child. If there is no child node is available with the same label, MS-PISA creates new child node with the corresponding sequence ID and calculate the POP of the elements which are present in the node in line 11. As shown in figure 4(t2) at node labeled by A ,MS-PISA checks the list of sequences of that node to find if there is any newly arriving element of the sequences ,say 01,02 and 03.The corresponding details are shown in figure 3(c) from line 12 to 27.If the upcoming node is not the root node , MS-PISA first deletes the sequences as obsolete from the list of sequences along with their corresponding POP from line 14 to 15.If there is no ID left in the list MS-PISA prunes way the node along with the POP from parent and goes to next node from line 26 to 27.The MS-PISA the left out sequences in the sequence list from line 16 to line 25 of figure 3(c). If there is a newly arriving element in eleset, for all combinations of candidate elements in the arriving data MS- PISA checks whether the element is on the path from root node in line 18.If it is not, that means that there is new candidate sequential pattern. Then MS-PISA follows the procedure from line 19 to 25 and creates new sequences or Childs along with the percentage of participation of elements same as from line 5 to 11.From line 26 to 27 we can note that if there is no sequence in a node that node and it’s children along with POP should be deleted. After processing all the common nodes (which means the process of constructing candidate sequential patterns was completed) then MS-PISA examines the sequence ID’s for frequent sequential patterns by considering the total POP of the nodes (including from root to child) should be equal to 100 or more than 100.Now MS-PISA will give the path from root to child as the output from line 28 to 29 of figure 3(c).In this way MS-PISA can find out all up-to-date frequent sequential patterns of time stamp t+1after constructing the original MMP-tree of time stamp t. The process of calculating the POP will be as follows. As shown in figure 3(a) the algorithm POP will check the elements from newly arriving data. If the element is having more than one item then it will identify the item which is having less MIS value as shown from line 3 to 4 in figure 3(a). Then it will calculate the corresponding POP of the element described from line 6 to8 of figure 3(a). Consider the example shown in the figure 1 and the MIS values of A, B, C and D are 0.8, 0.5, 0.3 and 0.2 consecutively. The POP of A, B, C and D for 5 sequences are 25, 40, 66, and 100. The figures 4 (a), 4(b), 4(c), 4(d), 4(e), 4(f), 4(g), 4(h) and 4(i) are the MMP-Trees of the example sequence database shown in figure 1. In the same figures we have shown the frequent sequences. The process was explained in detail in the next section.
  • 9. International Journal of Computer Science, Engineering and Applications (IJCSEA) Vol.2, No.5, October 2012 69 Algorithm POP (Items, MIS(I)) 1. var POP; 2. var I, Imin;// Items 3. For (all combinations of items in the ele); 4. If (No of items >1); 5. Check and identify the item Imin having less support; 6. POP (ele) =100/|Dbp,q |*MIS(Imin): // |Dbp,q | is no of sequences from p to q 7. else 8. POP (ele)=100/|Dbp,q |*MIS(I): // |Dbp,q | is no of sequences from p to q END Figure 3(a) Algorithm MS-Pisa (POI) 1. var PS; // PS-Tree 2. var current Time; // timestamp now 3. var eleSet; // used to store elements ele 4. while (there is still new transaction) 5. eleSet = read all ele at current Time; 6. construct (current Time, PT); 7. current Time++ END Figure 3(b) Algorithm construct (current time, PS) 1. for (each node of PS in post order) 2. if (node is null) 3. for (ele of every seq in eleSet) 4. for (all combination of elements in the ele) 5. if (element==label of one of ele.child) 6. if (seq is in node.child.seq_list) 7. update timestamp of seq to currentTime, POP (ele); 8. else 9. create a new sequence with current Time, POP(ele); 10. else 11. create a new child with element, current Time, POP(ele); 12. else for (every seq in the seq_list) 13. if (seq.timestamp<= currentTime_POI) 14. delete seq, POP (el) from seq_list and continue to next seq; 15. if (there is new ele of the seq in eleSet) 16. for (all combination of elements in the ele) 18. if (element is not in the path from Root) 19. if (element==label of one of node.child) 20. if (seq is in child.seq_list) 21. child.seq_list.seq.timestamp= seq.timestamp; 22. else 23. create a new sequence with seq.timestamp, POP (ele); 24. else 25. create a new child with element, seq.timestamp, POP (ele); 28. if (seq_list.Total POP (el)>=100%) 29. output the ele as a sequential pattern; END Figure 3(c)
  • 10. International Journal of Computer Science, Engineering and Applications (IJCSEA) Vol.2, No.5, October 2012 70 Figure 4(a) Figure 4(b) Figure 4(c)
  • 11. International Journal of Computer Science, Engineering and Applications (IJCSEA) Vol.2, No.5, October 2012 71 5.RESULT Our proposed algorithm is very effectively working for test dataset and we have analyzed the test dataset for different parameters like period of interest, execution time etc. 5.1 Impact of period of interest (POI) on execution time Execution time is the time required to execute all the instructions in the proposed algorithm. Here we can observe that the execution time will increase a little bit with respect to the time taken for identifying the item having minimum MIS value. For this we applied the quick sorting technique .It can be noted that execution time is directly proportional to POI. The reason is that, the increase in POI required more time for process as the no of new elements will be added to the existing one. Figure 5(a) shows the impact of POI over the execution time. Figure 5(a) 5.2 Impact of POI over number of patterns Number of patterns is dependent on period of interest a. As from F i g . 5 ( b ) we can see that as the period of interest increases, the number of patterns also increases. This is because as the period of interest increases, the algorithm has more items to process and so they give more number of patterns. Figure 5(b) 0 500 1000 1500 2000 2500 1 2 3 4 5 6 7 8 Timetocalculatethe patterns(msec) Time Vs Period Of Interest time 0 20 40 60 80 100 1 2 3 4 5 6 7 8 NoOfPatterns No Of Patterns Vs POI No Of Patterns Vs POI
  • 12. International Journal of Computer Science, Engineering and Applications (IJCSEA) Vol.2, No.5, October 2012 72 4.3 Comparison between MS-PISA and PISA in terms of time in progressive database Generally mining frequent sequential patterns with multiple minimum supports will take more time when compared with mining frequent sequences with uniform support. In case of percentage of participation (POP), the time for execution of Ms-PISA will take little bit more time than the PISA. The difference of time between these two will be the time for identifying the item having minimum MIS value from the newly arriving element and accumulating the POP. We can observe the difference in the Figure5(c) Figure 5(c) 6. CONCLUSION In the proposed novel work MS PISA algorithm is efficiently working by using the concept of percentage of participation (POP) for mining frequent sequences in progressive databases with multiple minimum supports. Even though the algorithm is taking little more time for execution when compared with PISA (which finds the frequent sequential patterns in progressive databases by considering uniform min.support for different items) it works on frequent sequences with multiple minimum supports for different items. As the novel algorithm is finding the sequences with multiple minimum supports we can ignore the little increase in time of execution. The algorithm MS-PISA is very time efficient in that MS-PISA needs only one scan of the candidate structure mentioned by MMP-Tree at each time stamp rather quadratic scans. In addition with the novel design of MMP-Tree , MS-Pisa takes less space. 0 500 1000 1500 2000 2500 3000 1 2 3 4 5 6 7 8 TimeToCalculatePatterns(msec) Time Variation for Ms-PISA vs PISA Time for MS-PISA Time PISA
  • 13. International Journal of Computer Science, Engineering and Applications (IJCSEA) Vol.2, No.5, October 2012 73 7. REFERENCES [1] R. Agrawal and R. Srikant, “Fast Algorithms for Mining Association Rules,” Proc. 20th Int’l Conf. Very Large Data Bases (VLDB ’94), pp. 478-499, Sept. 1994. [2] R. Agrawal and R. Srikant, “Mining Sequential Patterns,” Proc. 11th Int’l Conf. Data Eng. (ICDE ’95), pp. 3-14, Feb. 1995. [3] R. Srikant and R. Agrawal, “Mining Sequential Patterns: Generalizations and Performance Improvements,” Proc. Fifth Int’l Conf. Extending Database Technology (EDBT ’96), Mar. 1996. [4] B.Liu, W.Hsu and Y.Ma, “Mining association rules with multiple minimum supports”, Proceedings of the fifth ACM, SIGKDD conference Sandiego,CA,USA,August 15-18,1999,P.341 [5] M.J. Zaki, “Efficient Enumeration of Frequent Sequences,” Proc. Seventh ACM Int’l Conf. Information and Knowledge Management (CIKM ’98), Nov. 1998. [6] Jen-Wei Huang, Chi-Yao Tseng, Jian-Chih Ou, and Ming-Syan Chen, Fellow, IEEE: A General Model for Sequential Pattern Mining with a Progressive Database IEEE Transactions On Knowledge and Data Engineering VOL. 20, NO. 9, SEPTEMBER 2008 [7] S Cong, J. Han and D. Padua,“Parallel Mining of Closed Sequential Patterns,” Proc. 11th ACM SIGKDD Int’l Conf. Knowl- edge Discovery and Data Mining (KDD ’05), pp. 562-567, Aug. 2005. [8] J. Wang and J. Han, “Bide: Efficient Mining of Frequent Closed Sequences,” Proc. 20th Int’l Conf. Data Eng. (ICDE ’04), pp. 79-91, 2004. [9] X. Yan, J. Han, and R. Afshar, “Clospan: Mining Closed Sequential Patterns in Large Datasets,” Proc. Third SIAM Int’l Conf. Data Mining (SDM ’03), pp. 166-177, May 2003. [10] M.N. Garofalakis, R. Rastogi, and K. Shim, “Spirit: Sequential Pattern Mining with Regular Expression Constraints,” Proc. 25th Int’l Conf. Very Large Data Bases (VLDB ’99), pp. 223-234, 1999. [11] C. Luo and S.M. Chung, “Efficient Mining of Maximal Sequential Patterns Using Multiple Samples,” Proc. Fifth SIAM Int’l Conf. Data Mining (SDM), 2005. [12] H.Cao, N.Mamoulis, and D.W. Cheung, “Mining Frequent Spatio-Temporal Sequential Patterns,” Proc. Fifth Int’l Conf. Data Mining (ICDM ’05), pp. 82-89, Nov. 2005. [13] J.-K. Guo, B.-J. Ruan, and Y.-Y. Zhu, “A Top-Down Algorithm for Web Log Sequential Pattern Mining,” Proc. Ninth Pacific-Asia Conf. Knowledge Discovery and Data Mining (PAKDD), 2005. [14] C.-C. Ho, H.-F. Li, F.-F. Kuo, and S.-Y. Lee, “Incremental Mining of Sequential Patterns over a Stream Sliding Window,” Proc. IEEE Int’l Workshop Mining Evolving and Streaming Data IWMESD ’06), Dec. 2006
  翻译: