Академический Документы
Профессиональный Документы
Культура Документы
Node Monitoring with Fellowship Model against Black Hole Attacks in MANET .................................... 14
Rutuja Shah, M.Tech (I.T.-Networking), Lakshmi Rani, M.Tech (I.T.-Networking) and S. Sumathy, AP [SG]
Present a Way to Find Frequent Tree Patterns using Inverted Index ..................................................... 66
Saeid Tajedi and Hasan Naderi
IJCSBI.ORG
Madhu Sharma
Assistant Professor
Department of Computer Science, DIT University
Dehradun, India
ABSTRACT
Recently a lot of research has been done in the field of image encryption using chaotic
maps. In this paper, we propose a new symmetric block cipher algorithm using the 3D
Rossler system. The algorithm utilizes the approach used by Mohamed Amin et al.
[Commun. Nonlinear Sci. Numer. Simulat, (2010)] and Vinod Patidar et al. [Commun
Nonlinear SciNumerSimulat, (2009)]. The merits of these algorithms such as the encryption
structure and the diffusion scheme respectively are combined with an approach to split the
key for the three dimensions to use for encryption of color (RGB) images. The
experimentation results suggest an overall better performance of the algorithm.
Keywords
Image Encryption, Rossler System, Block Cipher, Security Analysis.
1. INTRODUCTION
Image encryption is relatively different from text encryption. Image is made
up of pixels and they are highly correlated; so different approaches are
followed for encryption of images [1-12]. One of the approaches is known
as chaotic cryptography. In this approach, for encryption we use chaotic
maps, which generate good pseudo-random numbers. Cryptographic
properties of these maps such as, sensitive dependence on initial parameters,
ergodic and random like behavior, make them ideal for use in designing
secure cryptographic algorithms. Many scholars have proposed various
chaos-based encryption schemes in recent years [4-12].
A scheme proposed by Mohamed Amin et al. [11] uses Tent map as the
chaotic map and the scheme is implemented for gray scale images. They
proposed a new approach of using the plaintext as blocks of bits rather than
block of pixels. Another scheme proposed by Vinod Patidaret al.[12] uses
chaotic standard and logistic maps and they introduce a way of spreading
the bits using diffusion to avoid redundancy. In this paper, we propose an
algorithm which utilizes the merits of the mentioned schemes. The
IJCSBI.ORG
algorithm uses the Rossler system for the chaotic key generation. We
demonstrate a way to split the 3 dimensions of the key for the 3 image
channels i.e. Red, Green and Blue. The algorithm in [11] is used as a base
structure and the diffusion concept from [12] is used to spread the effect of
adding the key. The symmetric Feistel structure, diffusion method and key
splitting of the encryption scheme provide better results.
The rest of the paper is organized as follows: Section 2 provides a brief
overview of the Rossler system. Section 3 provides the algorithmic details.
The results of the security analysis are shown in section 4. Lastly, Section 5
concludes the paper.
3. PROPOSED ALGORITHM
In this section we provide details of our algorithm. The algorithm is
designed to work with color images (RGB). In this scheme the plaintext
(image) is taken as blocks of bits. The block size is 8w, where w is the
word size which is 32 bits. Each block of data is divided and stored into 8
w-bit registers and operations are performed on them. The key length
IJCSBI.ORG
depends on the number of rounds r i.e. Key length is 4r+8. The number of
rounds can vary from 1-255. We have taken r to be 12 for our
experimentation.
The flowchart shown in Fig. 1 displays the various steps performed on the
image during the encryption process. The steps are explained in the
following subsections.
3.1 Padding
The processing of the image is done on block of data. 256 bits ie.32 bytes of
data are encrypted/decrypted at a time using eight 32-bit registers. The
image size should be a multiple of 256 bits to ensure that there is always a
full block size for encryption. Hence padding is added so as to make the
input block of size 32 bytes when the image size in bytes is not an integral
multiple of 32. A padding of all zeros (1-31 bytes) is appended to the end of
each row to make the bytes in each row a multiple of 32.
IJCSBI.ORG
For example if the image is of dimensions 252 x 252 pixels, a 4 byte
padding of zeros is appended at the end of each row. The last byte of the
image then stores the number of bytes used as padding as a pixel value i.e. 4
in this case. This pixel value is used to remove the padding after decryption.
After retrieving the number of bytes padded n, all rows are checked to
determine if zeros exist in all the last n bytes and in n-1 bytes of the last
row. The padding is then removed to generate the original image.
b. Use the decimal part of the X, Y, Z values to generate the key byte.
Xn = abs (Xn - integer part); // decimal part of x
Yn = abs (Yn - integer part); // decimal part of y
Zn = abs (Zn - integer part); // decimal part of z
d. For the next set of key bytes the number of iterations is changed to a
value obtained by performing exclusive-or on the current set of key
bytes.
Iterations for next key byte = XOR (Xn, Yn, Zn);
IJCSBI.ORG
exclusive or of the new second pixel and the third pixel, and so on. Thus the
first pixel of the channel remains unchanged.
The Vertical Diffusion is performed before and after the entire encryption
and horizontal diffusion is performed on the 3 channels of the image. In
Vertical Diffusion the channels are treated collectively. The processing
occurs from the last pixel of the image to the first pixel. It starts by
performing XOR of the green and blue values of the last pixel of the image
with the red value of the second last pixel to form the new red value of the
second last pixel. The green value of the second last pixel is formed by
performing XOR operation on the red and blue values of the last pixel. The
blue value of the second last pixel is formed by XOR operation on the red
and green values of the last pixel. This continues in the backward direction.
Thus the last pixel remains unchanged.
IJCSBI.ORG
IJCSBI.ORG
4. EXPERIMENTATION RESULTS
We performed security analysis on six 256 x 256 color(RGB) images as
shown in Fig. 4. The statistical and differential analysis tests performed
display very favorable results. The results display the strength and security
of the algorithm. The results have been given in [14] to demonstrate the
overcoming of vulnerability in [11].
Figure 4.Plain images (clockwise from top left): Lena, Bridge, Lake, Plane,
Peppersand Mandrill
IJCSBI.ORG
Figure 5.Left Side: Histogram of lena plain image for red, green and blue channels
(top to down). Right Side: Histogram of encrypted lena image for red, green and
blue channels (top to down).
IJCSBI.ORG
4.1.2 Correlation of Adjacent Pixels
In a plain image the adjacent pixels show a high degree of correlation in
horizontal, vertical and diagonal directions. The encrypted image should
have a very small degree of correlation among its adjacent pixels. We select
1000 random pairs of pixels from an image and the following formula gives
the correlation coefficient.
(,)
= (2)
()
where,
1
, = =1( ())( ()) (3)
1 2
= =1 (4)
1
= =1 (5)
Here xi and yi form the pair of ith adjacent pixels and N is the total number
of pairs.
Table 1 shows the correlation coefficient values of the six plain images (Fig.
4) between horizontal, vertical and diagonal adjacent pixels. It can be noted
that the adjacent pixels are highly correlated.
IJCSBI.ORG
Table 2 shows the correlation coefficient values for the Red, Green and Blue
channel of the cipher images formed by encrypting the plain images with the
proposed encryption algorithm. The cipher images bear very little
resemblance to the original images and that the adjacent pixels in the
horizontal, vertical and diagonal directions are correlated to a very small
degree.
IJCSBI.ORG
channel of cipher image; and so on for the green and blue channels of the
plain image. These are represented as CRR, CRG, CRB, CGR, CGG, CGB, CBR,
CBG, CBB; where for any Cij, i represents a channel (R,G,B) of plain image
and j represents a channel (R,G,B) of cipher image. The coefficient values
given in Table 3 depict that there is little or practically no correlation
between the plain image and its corresponding cipher image. The cipher
image thus displays characteristics of a random image.
Lena -0.0033 0.0016 0.0047 -0.0026 -0.0008 0.0006 -0.0029 0.0003 -0.0021
Bridge -0.0029 0.0005 0.0003 -0.0020 -0.0006 0.0011 0.0008 0.0007 0.0010
Lake -0.0012 0.0002 0.0005 -0.0041 -0.0007 0.0033 -0.0050 -0.0021 0.0039
Mandrill -0.0019 -0.0004 -0.0024 -0.0035 0.0011 -0.0036 -0.0034 0.0005 -0.0036
Peppers -0.0030 -0.0059 -0.0022 -0.0033 -0.0024 -0.0012 -0.0042 -0.0007 0.0005
Plane 0.0072 0.0014 -0.0003 0.0068 0.0025 0.0015 0.0057 0.0033 0.0033
0, X1 , = X 2 (, )
(, ) = (6)
1, X1 , X 2 (, )
IJCSBI.ORG
Values for NPCR and UACI are calculated as given in equations (7) and (8),
where W and H denote width and height of the cipher images, T denotes the
largest supported pixel value in the cipher images (255 in our case) and
abs() computes the absolute value. The NPCR and UACI values given in
Table 4 show that the encryption algorithm is secure against differential
attacks.
, (, )
NPCR = x 100% (7)
WxH
1 (x1 , x2 , )
UACI = , x 100% (8)
WxH
Table 4.NPCR and UACI Values Obtained for Encryption of 6 Plain images and Same
Images with 1 Pixel Changed
5. CONCLUSION
In this paper we proposed a new image encryption algorithm. The merits of
the recent research, based on results, were combined along with a symmetric
approach of encryption to provide a secure algorithm. The diffusion
mechanism along with Feistel structure makes the algorithm stronger. The
3D Rossler system of equations is used for the random key generation. The
splitting of the three dimensions of the key for the three channels makes the
cryptanalysis to obtain the key more difficult. The experimentation
performed depict that the algorithm generates favorable results.
REFERENCES
[1] Chang, C.-C., Hwang, M.-S.and Chen, T.-S., 2001. A New Encryption Algorithm for
Image Cryptosystems. Journal of Systems and Software, Vol. 58, No. 2, pp. 83-91.
[2] Yano, K. and Tanaka, K., 2002. Image Encryption Scheme Based on a Truncated
Baker Transformation. IEICE Transactions on Fundamentals of Electronics,
Communications and Computer Sciences, Vol. E85-A, No. 9, pp. 2025-2035.
[3] Gao, T. and Chen, Z., 2008. Image Encryption Based on a New Total Shuffling
Algorithm.Chaos, Solitons and Fractals, Vol. 38, No. 1, pp. 213-220.
IJCSBI.ORG
[4] Chen, G., Mao, Y. and Chui, C.K., 2004. A Symmetric Image Encryption Based on 3D
Chaotic Cat Maps. Chaos, Solitons and Fractals, Vol. 21, pp. 749-761.
[5] Mao, Y., Chen, G. and Lian, S., 2004. A Novel Fast Image Encryption Scheme Based
on 3D Chaotic Baker Maps. International Journal of Bifurcation and Chaos, Vol. 14,
No. 10, pp. 3613-3624.
[6] Guan, Z.-H., Huang, F. and Guan, W., 2005. Chaos Based Image Encryption
Algorithm. Physics Letters A, Vol. 346, pp. 153-157.
[7] Zhang, L., Liao, X. and Wang, X., 2005. An Image Encryption Approach Based on
Chaotic Maps. Chaos, Solitons and Fractals, Vol. 24, pp. 759-765.
[8] Gao, H., Zhang, Y., Liag, S. and Li, D., 2006. A New Chaotic Algorithm for Image
Encryption. Chaos, Solitons and Fractals, Vol. 29, pp. 393-399.
[9] Pareek, N.K., Patidar, V. and Sud, K.K., 2006. Image Encryption Using Chaotic
Logistic Map. Image and Vision Computing, Vol. 24, pp. 926-934.
[10] Wong, K.-W., Kwok, B.S.-H.and Law, W.-S., 2008. A Fast Image Encryption Scheme
Based on Chaotic Standard Map. Physics Letters A, Vol. 372, pp. 2645-2652.
[11] Amin, M., Faragallah, O.S. and Abd El-Latif, A.A., 2010. A Chaotic Block Cipher
Algorithm for Image Cryptosystems. Communications in Nonlinear Science and
Numerical Simulation, Vol. 15, pp. 3484-3497.
[12] Patidar, V., Pareek, N.K. and Sud, K.K.,2009. A New Substitution-Diffusion Based
Image Cipher Using Chaotic Standard and Logistic Maps. Communications in
Nonlinear Science and Numerical Simulation, Vol. 14, pp. 3056-3075.
[13] Rossler, O.E., 1976. An Equation for Continuous Chaos. Physics Letters A, Vol. 57,
No. 5, pp. 397-398.
[14] Kamat, V.G. and Sharma, M., 2014. Enhanced Chaotic Block Cipher Algorithm for
Image Cryptosystems. International Journal of Computer Science Engineering, Vol. 3,
No. 2, pp. 117-124.
IJCSBI.ORG
S. Sumathy, AP [SG]
School of Information Technology & Engineering, VIT University
Abstract
Security issues have been considerably increased in mobile ad-hoc networks. Due to absence of any
centralized controller, the detection of problems and recovery from such issues is difficult. The packet
drop attacks are one of those attacks which degrade the network performance. In this paper, we
propose an effective node monitoring mechanism with fellowship model against packet drop attacks
by setting up an observance zone where suspected nodes are observed for their performance and
behavior. Threshold limits are set to monitor the equivalence ratio of number of packets received at
the node and transmitted by node inside mobile ad hoc networks. This fellowship model enforces a
binding on the nodes to deliver essential services in order to receive services from neighboring nodes
thus improving the overall network performance.
Keywords: Black-hole attack, equivalence ratio, fair-chance scheme, observance zone, fellowship
model.
1. INTRODUCTION
Mobile ad-hoc networks are infrastructure less and self organized or configured
network of mobile devices connected with radio signals. There is no centralized
controller for the networking activities like monitoring, modifications and updating
of the nodes inside the network as shown in figure 1. Each node is independent to
move in any direction and hence have the freedom to change the links to other nodes
frequently. There have been serious security threats in MANET in recent years.
These usually lead to performance degradation, less throughput, congestion, delayed
response time, buffer overflow etc. Among them is a famous attack on packets
known as black-hole attack which is also a part of DoS(Denial of service) attacks. In
this, a router relays packets to different nodes but due to presence of malicious nodes
IJCSBI.ORG
these packets are susceptible to packet drop attacks. Due to this, there is hindrance is
secure and reliable communication inside network.
Section 2 addresses the seriousness of packet drop attacks and related work done so
far in this area. Section 3 elaborates our proposal and defending scheme for packet
drop attacks. Section 4 provides concluding remarks.
2. LITERATURE SURVEY
The packet drop loss in ad-hoc network gained importance because of self-serving
nodes which fail to provide the basic facility of forwarding the packets to
neighboring nodes. This causes an occupational hazard in the functionality of
network. Generally there are two types of nodes- selfish and malicious nodes. Selfish
nodes are those nodes which act in the context of enhancing its performance while
malicious nodes are those which mortifies the functions of network through its
continual activity. The WATCHERS [1] from UC Davis was presented to detect and
remove routers that maliciously drop or misroute packets. A WATCHER was based
on the principle of packet flow conservation. But it could not differentiate much
between malicious and genuine nodes. Although it was robust against byzantine
faults, it could not be much effective in todays internet world to reduce packet loss.
The basic mechanism of packet drop loss is that the nodes do not progress the
packets to other nodes selfishly or maliciously. Packet Drop loss could occur due to
Black hole attack. Sometimes the routers behave maliciously i.e. the routers do not
forwards packets, such kinds of attacks are known as Grey Hole Attack. In case of
routers, the attacks can be traced quickly while in the case of nodes its a
cumbersome task. Many researchers have worked in this field and have tried to find
IJCSBI.ORG
solutions to this attack [2-6]. Energy level was one of the parameter on which the
researchers have shown their results. This idea works on the basis of the ratio of
fraction of energy committed for a node, to overall energy contributed towards the
network. The node is retained inside the network on the basis of energy level and the
energy level is decided by the activeness of node in a network through mathematical
computations. Mathematical computations are [7] too complicated to clench and
sometimes the results are catastrophic. It can be said that the computations are
accurate but they are very much prone to ambiguity in the case of ad-hoc networks.
Few techniques involve usage of routing table information which is modified after
detecting the MAC address of malicious node which uses jamming style DoS attack
to cease their activities [8]. Another approach to reduce attacks was using historical
evidence trust management based strategy. [9] Direct trust value (DTV) was used
amongst neighboring nodes to monitor the behavior of nodes depending on their past
against black hole attacks. However, there is high possibility that trust values may
get compromised by the malicious nodes. Also the third party used for setting the
trust values is also vulnerable to attacks. Recent methods included an [10]
introduction to a new protocol called RAEED (Robust formally Analyzed protocol
for wirEless sEnsor networks Deployment) which reduces this attack but not by a
considerable percentage. To overcome the issues faced in order to implement these
strategies there is a need of an effective mechanism to curb these attacks and make
network more secure.
3. PROPOSED APPROACH
In this paper, we put forth a mechanism to reduce these packet-drop attacks by
implementing node monitoring with fellowship technique. We introduce an
obligation on the nodes inside a particular network to render services to network. If
services are not rendered, the node will be expelled outside the performance.
However, we have kept a fair-chance scheme for all nodes which help to make out
whether it is genuine node or malicious node.
IJCSBI.ORG
zone, the suspected node is given fair-chance treatment. That is, during
observance zone period, the suspected node is required to submit its status-
message to neighboring nodes to prove its genuineness of performance inside
network. The genuine nodes will promptly provide their status-message to
neighboring nodes because they will be willing to stay inside the network to render
services under obligation for the network. However, the malicious nodes may or may
not reply their status-messages to neighboring nodes since they have to degrade
performance of network. But, for such status-messages only fair-chance is given.
That is, a standard threshold level is been set up unanimously amongst neighboring
nodes inside network. Status-messages will be entertained only up to threshold level.
So, even if malicious nodes produce and fake their own status-messages to
neighboring nodes in order to sustain inside network due to threshold limits it will
not degrade network performance much. When threshold is crossed, the neighboring
nodes will be intimated about the node which is under observance zone and a
unanimous decision will be taken to expel that suspected node out of the network.
Because of this scheme, there is possibility that the suspected node is expelled
outside the network under 2 circumstances: either its genuine node (which are
underperforming) or malicious nodes. In both cases, the suspected node needs to be
expelled out of network because it is leading to performance degradation of the
network. The fair-chance scheme ensures that genuine nodes are given fair chance
to justify themselves and repair itself soon to prove their genuineness to render
services to network under obligation.
3.2 Scenario Assumptions
Let the nodes inside MANET be connected through wireless links with each other.
Let number of packets be transmitted and received with each other by the nodes. Let
nodes be named alphabetically from A,B,Cand so on till Z. Let node X be
malicious node which drops packets and undergoes black hole attack and hence has
poor equivalence ratio while node Y be the genuine node but has poor equivalence
ratio due to network congestion or may be due to some other network issues. All
nodes inside the network follow the principle of node monitoring with fellowship.
Data structures used are the networking parameters which are as follows:
1)equi_ratio = denoting the equivalence_ratio of nodes
IJCSBI.ORG
Steps involved:
Step 1: All nodes calculate their own equivalence ratio(equi_ratio) and share it with
their neighboring nodes(let them be at one hop distance) periodically.
Step 2: All nodes unanimously agree upon a standard threshold level (in this case,
threshold_value=3) through exchange of messages using agreement protocols.
Step 3: All nodes will monitor their neighbors equi_ratio and if any node has
equi_ratio which is quite poor then that particular node will be kept under
observance zone list through mutual exchange of messages of nodes inside
network. These nodes may be suspected as malicious nodes or genuine nodes but
with poor performance.
Step 4: Once the suspected node is kept in observance zone list, it is made
mandatory for that node to report the status_message to the neighboring nodes to
justify their performance and behavior.
Step 5: If its a malicious node (node X) it may either fake its status_message to
show its genuineness and stay inside network or it may just avoid sending its
status_message since it wishes to continue its malicious activities in future too. If it
is genuine node (node Y) it will send status_message in order to prove its
genuineness and try to improve its performance by repairing itself with the network
issues it is facing while sending the packets. However, in both the cases , we have
limited the frequency of justification through status_message by the nodes using fair
chance scheme wherein nodes are allowed to justify themselves only till certain
threshold_value(here, value=3 .i.e. only 3 times the suspected nodes are allowed to
send status_message in order to justify their performance). In short, malicious
nodes and genuine nodes which are underperforming both are kept under
surveillance to observe their behavior.
Step 6: Thus, the nodes which cross the limits of threshold_value, will be
immediately expelled outside the network through exchange of protocols and
messages between the neighboring nodes. In this way, packet-drop attacks can be
considerably reduced. Figure 2 explains the workflow mechanism.
IJCSBI.ORG
Set the threshold_value
unanimously and exchange
equi_ratio with neighboring nodes
periodically
Unacceptable
Exchange of
status_message
Check if below
threshold_value
Above threshold_value
IJCSBI.ORG
3.3 Advantages:
1. Fair chance scheme ensures genuineness of innocent nodes.
2. No complex mathematical computations of energy levels at each node.
3. Periodical reporting ensures removal of both underperforming and malicious
nodes from the network.
4. Up gradation of network performance in MANET.
3.4 Disadvantages
However, there is an overhead of exchanging more number of messages among the
neighboring nodes. Optimization on number of messages exchanged during
communication can be addressed and worked upon in future research.
4. CONCLUSION
In this paper, we have proposed a novel scheme to reduce packet drop attacks and
enhance the network performance. However, we anticipate our node-monitoring
with fellowship model may lead to increase in number of exchanged messages
amongst neighboring nodes during the agreement protocols inside network but at the
same time it will be robust against attacks and thus increase the availability of nodes
in mobile ad-hoc networks. The outcomes of minimizing packet drop loss have better
utility of channel, resources and QoS guaranteed which results in productive priority
management and a considerable controlled traffic by periodic surveillance over
nodes. The future research on this would be to reduce the exchange of messages
amongst the nodes, minimize the overhead and achieve optimization inside mobile
ad-hoc networks.
5. REFERENCES
[1] K. A. Bradley, S. Cheung, N. Puketza, B. Mukherjee and R. A. Olsson, Detecting Disruptive
Routers: A Distributed Network Monitoring Approach, in the 1998 IEEE Symposium on Security and
Privacy, May 1998.
[2] Y.C. Hu, A. Perrig and D. B. Johnson, Ariadne: A Secure On-demand Routing Protocol for Ad
Hoc Networks, presented at International Conference on Mobile Computing and Networking, Atlanta,
Georgia, USA, pp. 12 - 23, 2002.
[3] P. Papadimitratos and Z. J. Haas, Secure Routing for Mobile Ad hoc Networks, presented at SCS
Communication Networks and Distributed Systems Modeling and Simulation Conference, San
Antonio, TX, January2002.
[4] K. Sanzgiri, B. Dahill, B. N. Levine, C. Shields and E. M. Belding-Royer, A Secure Routing
Protocol for Ad Hoc Networks, presented at 10th IEEE International Conference on Network
Protocols (ICNP'02), Paris, pp. 78 - 89, 2002.
[5] V. Balakrishnan and V. Varadharajan, Designing Secure Wireless Mobile Ad hoc Networks,
presented at Proceedings of the 19th IEEE International Conference on advanced information
Networking and Applications (AINA 2005). Taiwan, pp. 5-8, March 2005.
IJCSBI.ORG
[6] V. Balakrishnan and V. Varadharajan, Packet Drop Attack: A Serious Threat to Operational
Mobile Ad hoc Networks, presented at Proceedings of the International Conference on Networks and
Communication Systems (NCS 2005), Krabi, pp. 89-95, April 2005.
[7] Venkatesan Balakrishnan and Vijay Varadharajan Short Paper: Fellowship in Mobile Ad hoc
Networks presented at Proceedings of the First International Conference on Security and Privacy for
Emerging Areas in Communications Networks (SECURECOMM05) IEEE.
[8] Raza, M., and Hyder, S.I. A forced routing information modification model for preventing black
hole attacks in wireless Ad Hoc network presented at Applied Sciences and Technology (IBCAST),
2012, 9th International Bhurban Conference, Islamabad, pp. 418-422, January 2012.
[9] Bo Yang , Yamamoto, R., Tanaka, Y. Historical evidence based trust management strategy
against black hole attacks in MANET published in 14th International Advanced Communication
Technology(ICACT), 2012 on pp. 394 399.
[10] Saghar, K., Kendall, D.and Bouridane, A. Application of formal modeling to detect black hole
attacks in wireless sensor network routing protocols .Applied Sciences and Technology (IBCAST),
2014, 11th International Bhurban Conference, Islamabad, pp. 191-194, January 2014.
Shah, R., Rani, L. and Sumathy, S. 2014. Node Monitoring with Fellowship Model
against Black Hole Attacks in MANET. International Journal of Computer Science
and Business Informatics, Vol. 14, No. 1, pp. 14-21.
IJCSBI.ORG
Sagayaraj Francis
Department of Computer Science and Engineering,
Pondicherry Engineering College, India
ABSTRACT
When an e-Learning System is installed on a server, numerous learners make use of it and
they download various learning objects from the server. Most of the time the request is for
same learning object and downloaded from the server which results in server performing
the same repetitive task of locating the file and sending it across to the requestor or the
client. This results in wasting the precious CPU usage of the server for the same task which
has been performed already. This paper provides a novel structure and an algorithm which
stores the details of the various clients who have already downloaded the learning objects
in a dynamic hash table and look up that table when a new request comes in and sends the
learning object from that client to the requestor thus saving the precious CPU time of the
server by harnessing the computing power of the clients.
Keywords
Learning Objects, e-Learning, Load Distribution, Load Balancing, Data Structure, Peer
Peer Distribution.
1. INTRODUCTION
1.1 e-Learning
Education is defined as the conscious attempt to promote learning in others
to acquire knowledge, skills and character [1]. To achieve this mission
different pedagogies were used and later on with the advent of new
information communication technology tools and popularity gained by
internet were used to enhance the teaching learning process and gave way to
the birth of e-learning [2]. This enabled the learner to learn by breaking the
time, geographical barriers and it allowed them to have individualized
learning paths [3]. The perception on e-Learning or electronic learning is
that it is a combination of internet, electronic form and network to
disseminate knowledge. The key factors of e-learning are reusing, sharing
resources and interoperability [4]. At present there are various organizations
IJCSBI.ORG
providing e-learning tools of multiple functionalities and one such is
MOODLE (Modular Object Oriented Dynamic Learning Environment) [5]
which is used in our campus. This in turn created difficulty in sharing the
learning objects between heterogeneous sites and standards such as SCORM
& SCORM LOM [6], IMS & IMS DRI [7], AICC [8] and likewise were
proposed by different organizations. In Berner-Lees famous architecture for
Semantic Web, ontologys are used for sharing and interoperability which
can be used to build better e-learning systems [9]. In order to define
components for e-learning systems the methodology used is the principle of
composibility in Service Oriented Architecture [10] since it enables us to
define the inter-relations between the different e-learning components. The
most popular model used nowadays in teaching learning process is Felder-
Silverman learning style model [11]. The e-Learning components are based
on key topics, topic types and associations and occurrences. VLE Virtual
Learning Environment is the software which handles all the activities of
learning. Learning Objects are the learning materials which promotes a
conscious attempt to promote visual, verbal, logical and musical intelligence
[12] through presentations, tutorials, problem solving and projects. By the
multimedia, gaming and simulation kin aesthetic intelligence are promoted.
Interpersonal, intrapersonal and naturalistic intelligence are promoted by
means of chat, SMS, e-mail, forum, video, audio conference, survey, voting
and search. Finally assessment is used to test the knowledge acquired by the
learner and the repository is the place which will hold all the learning
materials.
This algorithm is useful when the learners access the learning objects which
are stored in the repository. It reduces the servers response rate by the
directing a client to respond to the requestor with the file it has already
downloaded from the server.
IJCSBI.ORG
the computing power of the computer in the network in high performance
computing and scientific application, faster access to data and reducing the
computing time is still to be explored. In a P2P network the data is de-
clustered across the peers in the network. When there is requirement for a
popular data from across the peer then there occurs a bottleneck and
degrading the system response. So to handle this, a new strategy using a
new structure and an algorithm are proposed in this paper.
3. Divide the sum by length of the file name, so 12/3 = 4, which becomes
the index for the file in DHT. The above three steps are mathematical
formulated in equation 1.
IJCSBI.ORG
Address Table
3
* *
Binary
1 Tree
2
* *
* *
3 * *
* *
* *
- IP, CPU usage
Time
Figure 1. Proposed Data Structure
IJCSBI.ORG
Every index of DHT holds the starting address of a linked list in which
every node stores the file names that are already downloaded. The linked list
structure is used to avoid index collision between filenames generating the
same index in DHT. The index collision is avoided by creating a new node
in the linked list for the new file name. As shown in Figure 1, every node in
the linked list holds three values namely file name, address of the binary tree
and the last part holds the address of the next node in the list. The nodes of
the binary tree holds the active clients IPs and its current CPU processing
status. The binary tree is used to identify the leased CPU used client to
transfer the file to the requestor. This activity will harness the computing
power of the least used CPU. The binary tree structure was used to reduce
the search time for the least used client. If the file has not been downloaded
by any clients i.e. when the last node of the linked list is reached, then the
file is transferred from the server.
Algorithm 1 SEND ( ) {
Request directed to the File Cluster
Address of the File Cluster taken from the Address Table
Index Location of the File = HASHED (File Name)
If the index is out of bound
{
The file has not been downloaded by any client
It is sent from the server to the client
}
Else
{
While (not end of Linked List OR the node is not found)
{
If (Node. data == File Name)
{
Node found = true
While (end of binary tree}
{
Least usage time CPU IP = LEASTUSEDCPU()
}
}
If (Node found == true)
Send the requested file from the IP to the Requestor
Else
The file has not been downloaded by any client
It is sent from the server to the client
IJCSBI.ORG
}
End of the SEND ( )
Len = StringLength(Filename)
While (End of String)
{
IndexString += (Position of the Character in the Filename
+ Position of the Character in the alphabet list)
}
IndexInt = ConverttoInteger(IndexString)
return (IndexInt = IndexInt/Length of the Array)
End of HASHED ( )
3. MATHEMATICAL FORMULATION
IJCSBI.ORG
n
index ( Z K (f3)) X(f2) goto 6 (3)
K=1
n
Min Y = C (m, m+1) (4)
m=1
f2 = X -1 Y (f1) (5)
f2 = X -1 S (f1) (6)
where,
index is the index in the Dynamic Hash Table
l is the length of the file name
i is the character position in the file name
j = { 1,2,3,4,5.26 }
k is the number of nodes in the linked list
Z is the node in the linked list
f3 is the file name in the node in the linked list
f2 is the targeted file
m is the nodes in the Binary Tree
Y is the node in the Binary Tree with the minimum C
C is the CPU usage time of the specified IP
S is the Server
4. CONCLUSION
The main advantage from this architecture is that the server time is saved by
harnessing the computational power of the clients who have already
downloaded the file to send across it to the requestor. Another advantage of
the architecture is the file search, which has been fastened due to the
Dynamic Hashing table and Binary tree structures. This algorithm is been
currently implemented using PHP and the results of it will be published in
the further publications. Initial results indicate that there is substantial
reduction of the servers CPU processing time when this algorithm is
executed on the server.
IJCSBI.ORG
REFERENCES
[1] Lavanya Rajendran, Ramachandran Veilumuthu., 2011. A Cost Effective Cloud
Service for E-Learning Video on Demand, European Journal of Scientific Research,
pp.569-579.
[2] Maria Dominic, Sagyaraj Francis, Philomenraj., 2013. A Study On Users On Moodle
Through Sarasin Model, International Journal Of Computer Engineering And
Technology, Volume 4, Issue 1, pp 71-79.
[3] Maria Dominic, Sagyaraj Francis., 2013. Assessment Of Popular E-Learning Systems
Via Felder-Silverman Model And A Comphrehensive E-Learning System,
International Journal Of Modern Education And Computer Science, Hong Kong,
Volume 5, Issue 11, pp 1-10.
[4] Zhang Guoli, Liu Wanjun, 2010. The Applied Research of Cloud Computing. Platform
Architecture in the E-Learning Area, IEEE.
[5] www.moodle.org
[6] SCORM(Sharable Courseware Object Reference Model), http://www.adlnet.org
[7] IMS Global Learning Consortium, Inc., Instructional Management System (IMS),
http://www.imsglobal.org.
[8] http://www.aicc.org
[9] Uschold, Gruninger., 1996. Ontologies, Principles , Methods and Applications,
Knowledge Engineering Review, Volume 11, Issue 2.
[10] Papazoglou, Heuvel., 2007. Service Oriented Architectures: Approaches,
Technologies, and research issues, The VLDB Journal, , Volume 16, Issue 3, pp. 389-
415.
[11] Graf, Viola, Kinshuk., 2006. Representative Characterestics of Felder-Silverman
Learning Styles: an Empirical Model, IADIS, pp. 235-242.
[12] Lorna Uden, Ernesto Damiani., 2007. The Future of E-Learning: E-Learning
ecosystem, Proceeding of IEEE Conference on Digital ecosystems and Techniques,
Australia, pp. 113-117.
[13] Maria Dominic, Sagyaraj Francis., 2012. Mapping E-Learning System To Cloud
Computing, International Journal Of Engineering Research And Technology, India,
Volume1, Issue 6.
[14] Chyouhwa Chen, Kun-Cheng Tsai., 2008. The Server Reassignment Problem forload
Balancing In Structured P2P Systems, IEEE Transactions On Parallel And Distributed
Systems, Volume 19, Issue 2.
[15] A. Rao, K. Lakshminarayanan, S. Surana, R. Karp, and I. Stoica., 2006. Load
Balancing in Structured P2P Systems. Proc. Second Intl Workshop Peer-to-Peer
Systems (IPTPS 03).
IJCSBI.ORG
ABSTRACT
This paper determines the degree of information sharing in government institutions through e-
transparent tools. First the basis for the study is set through the background, problem statement
and objectives. The discussion then proceeds by focusing on ICT tools for information sharing.
An information sharing model is proposed and the extent of information sharing in the public
sector of Tanzania through online media is discussed; furthermore, the correlation that exists
between the extent of information sharing and factors such as accessibility, understandability,
usability and reliability is established. The paper concludes by providing recommendations on
information sharing and how it can be enhanced through e-transparency systems for public
service delivery in an open society.
Keywords
E-transparency, E-Governance, Information Sharing, Public Sector, ICT.
IJCSBI.ORG
low transparency and bureaucracy (Im & Jung, 2001); as the result this method
allows for subversion of accountability (Lubua, 2014).
The appropriate use of e-transparency tools is perhaps the best strategy for the
organisation to enhance information sharing with their stakeholders. The
organisation has to emphasize good qualities of information sharing such as
timely response, accessibility of systems, reliability of data, online security,
completeness of online procedures and openness in service processes. Basically,
this paper discusses different issues, including the need of online information
sharing in the public sector and the extent to which government institution
applies online media for information sharing and service provision. The study is
based on opinions from clients who are consumers of such services.
2. PROBLEM STATEMENT
Business competition compels organisations to invest in information systems to
improve the efficiency of their operations (Barua, Ravindran, & Whinston, 2007).
This investment is made possible through the knowledge of employees, suppliers,
customers, and other key stakeholders. In this regard the organization that shares
its information with stakeholders more efficiently earns a competitive advantage
(Drake, Steckler, & Koch, 2004).
IJCSBI.ORG
The government of Tanzania acknowledges the importance of ICTs in promoting
information sharing in the society. It uses methods such as conferences,
workshops, public portals, and so on to show its intention for maximizing
information sharing. With the growth of the number of users of ICTs, the degree
of information sharing is expected to increase. Therefore, the study intends to
establish the extent to which uses of ICTs have enhanced information sharing.
Further, the correlation between the extent of information sharing and factors
which negatively influence the perception of users will also be established by the
study.
3. OBJECTIVES
This study is designed to cover the following objectives;
i. To determine the extent of information sharing through e-transparency in
the Tanzanian public sector?
ii. To establish the extent to which information usefulness,
understandability, reliability and accessibility influences information
sharing through e-transparency systems.
4. METHODOLOGY
This study was conducted through a mixed research method. First, the study
reviewed a number of literatures to establish its relevance. Then, the Tanzanian
Revenue Authoritys Custom Online System was identified as a case for study,
followed by survey procedures. Data were collected from twenty (20) clearing
and forwarding companies that operate under Custom regulations of the Tanzania
Revenue Authority. The study received and analysed a total of 40 responses. The
study collected data from original sources to enhance validity and relevance. The
analytical models used include the Spearmans Correlation Model and Regression
Analysis.
IJCSBI.ORG
Information User
No Information in the
Information Needs Source
Action
The choice of the information is dictated by the gap which has to be covered.
When this gap is expressed, it becomes a need. Upon responding to the need, the
user of information consults the source which is either electronic or physical. It is
possible that the source may not have the type of information requested or the
information may not be satisfying. Regardless of the status of satisfaction, the
user of information takes action towards covering the gap. In case the public
seeks information from government institutions, dissatisfaction may influence
IJCSBI.ORG
members of the public to take action, even against the government, on the other
hand satisfaction influences more support from the government (Lubua, 2014).
The satisfied user of the information applies it to solve the problem identified in
the gap. A good example could be of a farmer, who was searching for a good
market of his/her harvests; s/he will eventually use the information to choose a
better market.
Together with the progress made in information sharing, there is the need to
know the extent which government institutions apply online media for
information sharing. This study is based on opinions from clients who are
consumers of online information from a government institution.
Based on the response from clients of Tanzania Revenue Authority, it was found
that, 70% of respondents agree that the Tanzanian revenue authority, sufficiently
shares its information through online media. These respondents are clients of
Custom services who benefit from the Custom Online System (CULAS). The
following factors influenced a successful deployment of this system:-
IJCSBI.ORG
IJCSBI.ORG
influenced by these factors and a linear regression model is used to demonstrate
the relationship. The Linear regression analysis was used to establish the
relationship between these variables as shown in Table 1 below.
1
Who are clearing and forwarding experts
IJCSBI.ORG
of the information to be entered to ensure consistency; further, it grants the user
with the opportunity to proof read their data entry before the information is
completely submitted.
9. CONCLUSION
The purpose of the study was to establish the degree to which the Tanzanian
public sector uses ICTs to enhance transparency. The assessment was guided by
the fact that Tanzania advocates good governance, of which information sharing
is an important component. Also, the study recognises that ICTs play an
important role in the business sector to ensure that the client access services
efficiently with maximum transparency. The same business experience could be
adopted by the government to raise the level of satisfaction of citizens about
government services. The study observed that many people are aware of the
importance of ICTs in ensuring transparency in government operations. However,
there are several cases where the performance in government operations did not
meet users expectations. Factors such as low reliability of the system and
ineffectiveness of officials operating the system were among those which affected
the use of ICTs for enhanced transparent services. While training was identified
to be important is equipping users with the required technical skills; it was
occasionally observed to be the opposite. Training required follow-ups to ensure
that it meets expected goals. Equally to this, information accessibility, reliability,
usefulness and user understanding ability have great impact on the experience of
users towards online media.
IJCSBI.ORG
REFERENCES
[1] Badillo-Amador, L., Garca-Snchez, A., & Vila, L. E. (2005). Mismatches In The Spanish
Labor Market: Education Vs. Competence Match. International Advances in Economic
Research, Vol 11, 93-109.
[2] Barua, A., Ravindran, S., & Whinston, A. (2007). Enabling Information Sharing Within
Organizations. Information Technology and Management, Vol (3), 31 - 45 .
[3] Cohen, J. (2012). Benefits Of On Job Training. Retrieved February 7, 2013, from
http://jobs.lovetoknow.com
[4] Drake, D., Steckler, N., & Koch, M. (2004). Information Sharing In And Across Government
Agencies: The Role And Influence Of Scientist, Politician, And Bureaucratic Subcultures.
Social Science Computer Research, 22(1), , 6784.
[5] HAKIELIMU, LHRC, REPOA. (2005). Access To Information In Tanzania: Is Still A
Challenge. Retrieved September 11, 2012, from
http://www.tanzaniagateway.org/docs/Tanzania_Information_Access_Challenge.pdf
[6] Hatala, J.-P., & Lutta, J. (2009). Managing Information Sharing Within an Organisation
Settings: A social Network Perspective. Retrieved September 13, 2012, from
http://www.performancexpress.org/wp-content/uploads/2011/11/Managing-Information-
Sharing.pdf
[7] Im, B., & Jung, J. (2001). Using ICT For Strengthening Government Transparency. Retrieved
May 10, 2011, from http://www.oecd.org/dataoecd/53/55/2537402.pdf
[8] Kilama, J. (2013). Impacts Of Social Networks In Citizen Involvements To Politics . Dar es
Salaam: Mzumbe University.
[9] Mkapa, B. (2003). Improving Public Communication Of The Government Policies And
Enhancing Media Relations. Bagamoyo.
[10] Navarra, D. D. (2006). Governance Architecture Of Global ICT Programme: The Case Of
Jordan. London: London School of Economics and Political Science.
[11] United Republic of Tanzania. (1995). The Constitution of United Republic of Tanzania. Dar
Es Salaam, Tanzania: Government Printer.
[12] Van Niekerk, B., Pillay, K., & Maharaj, M. (2011). Analyzing the Role of ICTs in the
Tunisian and Egyptian Unrest. International Journal of Communication, 5(14061416).
IJCSBI.ORG
ABSTRACT
A graph is a basic data structure which, can be used to model complex structures and the
relationships between them, such as XML documents, social networks, communication
networks, chemical informatics, biology networks, and structure of web pages. Frequent
subgraph pattern mining is one of the most important fields in graph mining. In light of
many applications for it, there are extensive researches in this area, such as analysis and
processing of XML documents, documents clustering and classification, images and video
indexing, graph indexing for graph querying, routing in computer networks, web links
analysis, drugs design, and carcinogenesis. Several frequent pattern mining algorithms have
been proposed in recent years and every day a new one is introduced. The fact that these
algorithms use various methods on different datasets, patterns mining types, graph and tree
representations, it is not easy to study them in terms of features and performance. This
paper presents a brief report of an intensive investigation of actual frequent subgraphs and
subtrees mining algorithms. The algorithms were also categorized based on different
features.
Keywords
Graph Mining, Subgraph, Frequent Pattern, Graph indexing.
1. INTRODUCTION
Today we are faced with ever-increasing volumes of data. Most of these
data naturally are of graph or tree structure. The process of extracting new
and useful knowledge from graph data is known as graph mining [1] [2]
Frequent subgraph patterns mining [3] is an important part of graph mining.
It is defined as process of pattern extraction from a database that the
number frequency of which is greater than or equal to a threshold defined by
the user. Due to its wide utilization in various fields, including social
network analysis [4] [5] [6], XML documents clustering and classification
[7] [8], network intrusion [9] [10], VLSI reverse [11], behavioral modeling
[12], semantic web [13], graph indexing [14] [15] [16] [17] [18], web logs
analysis[19], links analysis[20], drug design [21] [22] [23], and
Classification of chemical compounds[24] [25] [26], this field has been
subject matter of several works.
IJCSBI.ORG
IJCSBI.ORG
Support (G) =
IJCSBI.ORG
0). Furthermore, the nodes are represented on the main diameter of the
matrix (Figure.2).To Show the graph as a string a combination of nodes and
edges as a sequence in particular order can be used, and since every
permutation of the nodes may generate a specific string, a series of
maximum or minimum canonical adjacency matrix (CAM) must be taken
into account. An advantage of this is that two isomorphism graphs will have
the same maximum/minimum CAM.
Figure.2. Left side a graph and right side corresponding adjacency matrix
4.1.2 Adjacency List
Another way to represent a graph is adjacency list. When the graph is
sparse, several zeros are generated in the adjacency matrix, which is a
great waste of memory and to avoid this, adjacency list is an answer as it
assigns memory dynamically
4.2 Subgraph Generation
Two subgraphs can be mixed to generate candidate subgraph and the result
will be a new subgraph. However, given that many copied subgraphs might
be generated in the mixing process, the way of generating candid subgraphs
is critical. Among the available methods are extension and right most
expansion. In the latter case, the subgraphs are expanded in one direction
and no duplicate candidate is generated.
4.3 Frequency Counting
To check if the generated candidates is a duplicate or not, the frequency of
each must be determined and compared with the support value. Of the data
structures, which are used to count the frequency of each candidate are
embedding list and TSP tree.
5. A SURVEY OF FREQUENT SUBGRAPH MINING
ALGORITHMS
5.1 Classification Based on Algorithmic Approach
5.1.1 Apriori-Based (Breadth First Search)
This category of algorithms uses generates and test method and surface
search to find a subgraph from the network that consist the database.
Therefore, before the subgraph with length of k +1 ((k+1)-candidate), all
frequent subgraphs with length of k must be found. Thus, each candidate
with length of k +1 is obtained by connecting two frequent subgraphs with
IJCSBI.ORG
IJCSBI.ORG
Figure.3. A database consisting of three graph g1, g2, g3 and two subgraph
and frequency of each
5.4 Classification Based on Nature of the Output
5.4.1 Completeness of the Output
While, some algorithms find all frequent patterns, some other only mines
part of frequent patterns. Frequent patterns mining is closely related to
performance. When the total size of dataset is too high, it is better to use
algorithms that are faster in execution so that reduction of the performance
is avoided, although, not all frequent patterns are minded. Table 2 lists the
completeness of output [29].
Table 2. Completeness of Output
Incomplete Output Complete Output
SUBDUE FARMER
GREW gSpan
CloseGraph FFSM
ISG Gaston
FSG
HSIGRAM
5.4.2 Constraint-Based
With increase of size database, the number of frequent pattern is increased.
This makes maintenance and analyzing more difficult as it needs more
memory space. Reducing the number of frequent patterns without losing the
data is achievable through mining and maintains more comprehensive
patterns. Given that each pattern satisfies the condition of being frequent the
IJCSBI.ORG
IJCSBI.ORG
Dynamic Set of graphs Sparse graph Iterative Merging Canonical Spanning Tree
GREW Set of graphs Adjacency Matrix Vertex Extension Suffix trees
AGM Set of graphs Search Tee Disjunctive Normal Canonical Labeling
MUSE Set of graphs Lattice Form DFS coding
MARGIN Set of graphs Adjacency Matrix Join CAM
AcGM Set of graphs Adjacency Matrix Join CAM
gFSG Set of graphs Iterative Merging Hashtree
IJCSBI.ORG
2. Using a new method for tree representation and look up table that allows
quick access to the information nodes in the candidate generation phase
without having to read the trees of the database.
3. using right most expansion to candidate generation that guaranteed not
generate duplicate candidate.
This algorithm uses lookup table that is implemented as Hash table to store
input trees information. It is the key part, represented as the pair of (T,pos),
where T is identification of input tree and pos is number in preorder
traversal, and value part, represented as (l,s), where l is label and s is scope
of node. In this algorithm a new candidate is generated using scope of each
node That means, first node, which is added to the other node should be
added along the right most expansion and that within the scope of the first
node to be added continually this process other frequent pattern is found
[64].
This algorithm uses FP-growth method to find frequent subgraphs, Its input
is a set of graphs (Transactional database). First a BitCode for each edge is
defined, then a set of edge is defined for each edge .When, edge is found in
the each of graphs, the BitCode is 1 and otherwise 0. Then a frequency
table is sorted in ascending order based on equivalent BitCode belongs to
each edge and afterward, FP tree is constructed and frequent subgraphs are
obtained through depth traversal [55].
6. FREQUENT SUBTREES MINING ALGORITHMS
CLASSIFICATION
6.1 Trees Representation
A tree can be encoded as a sequence of nodes and edges. Some of most
important ways of encoding trees are introduced below:
6.1.1 DLS (Depth Label Sequence)
Let T be a Labeled Ordered Tree and depth-label pairs including labels and
depth for each node are belonged to V. For example, (d(vi),l(vi)) are added
to string s throughout DFS traversal of tree T. Depth-label sequence of tree
T is obtained as { d(v1), l(v1)), ,(d(vk), l(vk) }. For instance, DLS for tree
in Figure.4 can be presented as follow:
{(0,a),(1,b),(2,e),(3,a),(1,c),(2,f),(3,b),(3,d),(2,a),(1,d),(2,f)(3,c)}
6.1.2 DFS LS (Depth First Sequence)-(Label Sequence)
Assumed a labeled ordered tree, Labels is added to string of s
during the DFS traversal of Tree T. During backtrack -1or$or / is added
IJCSBI.ORG
IJCSBI.ORG
IJCSBI.ORG
IJCSBI.ORG
growth, it is featured with less computation work and needs smaller memory
size. Moreover, these algorithms are specifically designed for trees and
graphs and cannot be used for other purposes. On the other hand, as they
work on variety of datasets, it is not easy to find tradeoffs between them.
The same frequent patterns can be used for searching similarity, indexing,
classifying graphs and documents in future studies. Parallel methods and
technologies such as Hadoop can also be needed when working with
excessive data volume.
8. ACKNOWLEDGMENTS
Authors are thankful to Mohammad Reza Abbasifard for their support of the
investigations.
REFERENCES
IJCSBI.ORG
[11] Yoshida, K. and Motoda, 1995. CLIP: Concept learning from inference
patterns, in Artificial Intelligence, pp. 6392.
[12] Wasserman, S., Faust, K., and Iacobucci. D, 1994. Social network analysis :
Methods and applications. Cambridge university Press.
[13] Berendt, B., Hotho, A., and Stumme, G., 2002. semantic web mining, in In
Conference International Semantic Web (ISWC), pp. 264278.
[14] S.C.Manekar, M.Narnaware, May 2013. Indexing Frequent Subgraphs in
Large graph Database using Parallelization, International Journal of Science
and Research (IJSR), Vol. 2 , No. 5.
[15] Peng, Tao, et al., 2010. A Graph Indexing Approach for Content-Based
Recommendation System, in IEEE Second International Conference on
Multimedia and Information Technology (MMIT), pp. 93-97.
[16] S.Sakr, E.Pardede, 2011. Graph Data Management: Techniques and
Applications, in Published in the United States of America by Information
Science Reference.
[17] Y.Xiaogang, T.Ye, P.Tao, C.Canfeng, M.Jian, 2010. Semantic-Based Graph
Index for Mobile Photo Search," in Second International Workshop on
Education Technology and Computer Science, pp. 193-197.
[18] Yildirim, Hilmi, and Mohammed Javeed Zaki., 2010. Graph indexing for
reachability queries, in 26th International Conference on Data Engineering
Workshops (ICDEW)IEEE, pp. 321-324.
[19] R.Ivancsy and I.Vajk, 2006. Frequent Pattern Mining in Web Log Data, in
Acta Polytechnica Hungarica, pp. 77-90.
[20] G.XU, Y.zhang, L.li, 2010. Web mining and Social Networking. melbourn:
Springer.
[21] S.Ranu, A.K. Singh, 2010. Indexing and mining topological patterns for drug,
in ACM, Data mining and knowlodge discovery, Berlin, Germany.
[22] (2013, Dec.) Drug Information Portal. [Online]. http://druginfo.nlm.nih.gov
[23] (2013, Dec.) DrugBank. [Online]. http://www.drugbank.ca
[24] Dehaspe,Toivonen, and King, R.D., 1998. Finding frequent substructures in
chemical compounds, in In Proc. of the 4th ACM International Conference on
Knowledge Discovery and Data Mining, pp.30-36.
[25] Kramer, S., De Raedt, L., and Helma, C., 2001. Molecular feature mining in
HIV data, in In Proc. of the 7th ACM SIGKDD International Conference on
Knowledge Discovery and Data Mining (KDD-01), pp. 136143.
[26] Gonzalez, J., Holder, L.B. and Cook, 2001. Application of graph-based
concept learning to the predictive toxicology domain, in In Proc. of the
IJCSBI.ORG
IJCSBI.ORG
IJCSBI.ORG
[52] Zhang, Shijie, J.Yang, and Shirong Li, 2009. Ring: An integrated method for
frequent representative subgraph mining, in Ninth IEEE International
Conference on Data Mining (ICDM), pp. 1082-1087.
[53] Fromont, Elisa, Cline Robardet, and A.Prado, 2009. Constraint-based
subspace clustering, in International conference on data mining, pp. 26-37.
[54] Ranu, Sayan, and Ambuj K. Singh., 2009. Graphsig: A scalable approach to
mining significant subgraphs in large graph databases, in IEEE 25th
International Conference on Data Engineering (ICDE), pp. 844-855.
[55] R. Vijayalakshmi,R. Nadarajan, J.F.Roddick,M. Thilaga, 2011. FP-
GraphMiner, A Fast Frequent Pattern Mining Algorithm for Network Graphs,
Journal of Graph Algorithms and Applications, Vol. 15, pp. 753-776.
[56] Zhu, Feida, et al., 2007. gPrune: a constraint pushing framework for graph
pattern mining, in Advances in Knowledge Discovery and Data Mining, , pp.
388-400, Springer Berlin Heidelberg.
[57] Yan, Xifeng, X. Zhou, and Jiawei Han, 2005. Mining closed relational graphs
with connectivity constraints, in Proceedings of the eleventh ACM SIGKDD
international conference on Knowledge discovery in data mining, pp. 324-
333.
[58] Wu, Jia, and Ling Chen, 2008. A fast frequent subgraph mining algorithm, in
The 9th International Conference for Young Computer Scientists (ICYCS), pp.
82-87.
[59] Krishna, Varun, N. N. R. R. Suri, G. Athithan, 2011. A comparative survey of
algorithms for frequent subgraph discovery, Current Science(Bangalore), pp.
1980-1988.
[60] K.Lakshmi, T. Meyyappan, Apr. 2012. A COMPARATIVE STUDY OF
FREQUENT SUBGRAPH MINING ALGORITHMS, International Journal
of Information Technology Convergence and Services (IJITCS), Vol. 2, No. 2.
[61] C.Jiang, F.Coenen, M.Zito, 2004. A Survey of Frequent Subgraph Mining
Algorithms, The Knowledge Engineering Review, pp. 1-31.
[62] M.Gholami, A.Salajegheh, Sep. 2012. A Survey on Algorithms of Mining
Frequent Subgraphs, International Journal of Engineering Inventions, Vol. 1,
No. 5, pp. 60-63.
[63] V.Singh, D.Garg, Jul. 2011. Survey of Finding Frequent Patterns in Graph
Mining: Algorithms and Techniques, International Journal of Soft Computing
and Engineering (IJSCE), Vol. 1, No. 3.
[64] Hussein, M.MA, T. H.Soliman, O.H. Karam, 2007. GP-Growth: A New
Algorithm for Mining Frequent Embedded Subtrees. 12th IEEE Symposium on
Computers and Communications.
IJCSBI.ORG
IJCSBI.ORG
IJCSBI.ORG
ABSTRACT
Several IT service management (ITSM) frameworks have been deployed and are being
adopted by companies and institutes without redefining the framework to a model which
suits their IT departments operating environment and requirements. An IT service
management model is proposed for Zimbabwean universities and is a holistic approach
through integration of Operational Level Agreements (OLAs), Service Level Agreement
(SLAs) and IT Service Catalogues (ITSCs). OLA is considered as the domain for
describing IT Service management and its attainment is geared by organizational
management and IT section personnel in alignment with the mission, vision and values of
the organization. Explicitly defining OLAs will aid management in identification of key
services and processes in both qualitative and quantitative form (SLAs). After defining
SLAs then ITSCs can be formulated, a measure which is both customer and IT service
provider centric and acts as the nucleus of the model. Redefining IT Service Management
from this this perspective will result in deriving value from IT service management
frameworks and customer satisfaction.
1. INTRODUCTION
The IT service management is a modern concept adopted by the IT
community for improved IT services delivery and productivity to attain
customer satisfaction and control costs. IT Service Management is an
integration of IT services provisioning between service providers and end
users to arrive at end-to-end service through the implementation of
measures such as Service Level Agreements (SLAs), Operations Level
Agreements (OLAs) and IT Service Catalogues (ITSCs) (Almeroth &
Hasan, 2002). Service management frameworks in IT industry have been
developed such as Control Objectives for Information and related
Technology (COBIT), and IT Infrastructure Library (ITIL) but have not
been related for a specific IT sections given its operating environment and
IJCSBI.ORG
constraints. IT Service is the nucleus in accomplishment of business
processes at a University, thus it supports academic research, learning and
teaching. Universities offers IT Services to staff, researchers and students,
visitors and partners on platforms such as Electronic Learning (ELearning),
library services, staff directory and email, learning resources which are
crucial to learning, teaching, and collaboration as the community becomes
global. The IT department must offer better services to these stakeholders in
a resource constraint environment (staff and financial resources) (University
of Birmingham, 2014).
2. RELATED WORKS
An ITS service consists of three key elements namely, a Service Level
Agreements (SLAs), Operational Level Agreements (OLAs) and Service
Catalogue page/s. Operational Level Agreements (OLAs) are agreements
between the ITS teams and such as hardware, software and networking
teams on how they will collaborate to ensure the appropriate service level is
met for a particular service under supervision of a coordinator and it defines
the expectations and commitments needed to deliver Service Level
Agreements (SLAs) (University of California, 2012). Service Level
Agreements (SLAs) are agreements between the Information Technology
Services (ITS) team or teams and their clients which define the level of
service the client should receive. An IT service catalogue is a mapping
database of an institutes available technological resources and products/ IT
services in offer and about to be rolled out (Griffiths, Lawes, & Sansbury,
2012; Moeller, 2013). The ITS Service Catalogue is the division of services
offered at an institute into components with policies, guidelines and
responsibilities of parties involved, SLAs and delivery conditions (Bon et
al., 2007).
The service level catalogue should be readily accessible to authorised users
and facilitate them to create a service request on behalf of themselves and
others, and contain facilities to approve service requests. IT service
catalogues should be tested by both IT and key users so that the product
complies with the prescribed technical functionality and usability metrics.
The IT catalogue should be developed in such a way that it facilitates
effective communication between IT management and stakeholders
involved and acts as an effective tool for good governance (Griffiths et al.,
2012; Moeller, 2013).
Basically an IT service catalogue is divided into business service catalogue
and technical service catalogue. A business service catalogue is client
centric and must meet users requirements thus the user community should
be engaged in requirement gathering and design. Alternatively, a technical
service catalogue is service provider centric and focuses on specific services
description in IT terms including services constructs and their
IJCSBI.ORG
interrelationships. IT managerial and technical staff work processes are
explicitly defined and the technical service catalogue access is mainly
restricted to organizational (Troy, Rodrigo, & Bill, 2007).
A SLA should consist of the following elements namely, placement of
services into categories (sections for catalogue), listing of each category as a
service catalogue section, establishing integrated/packaged/bundled service
products, identification of modular service products , definition of each
service product, establishing service owner and supplier, defining
procurement procedures(how and the cost), specifying service level metrics
(availability, reliability, response), defining limits of service and defining
customers responsibilities thus it provides a basis for managing the
relationship between the service provider and the customer, describing the
agreement between the service provider and customer for the service to be
delivered, including how the service is to be measured (Hiles, 2000).
A service must provide a bridge from the developers and engineers point of
view to the end-users perspective and identifies internal processes
necessary to offer and maintain the services. Services change management
and continuous processes improvement is important in addressing
stakeholders needs (University of California, 2012). A Service Lifecycle
basically focuses on defining a service strategy thus maintaining it and
implementing it, service designing which focuses on the methodology and
architectural design to offer the service, thirdly, service transition, which
focuses on testing and integration of services offered for quality and control
compliance, and finally service operation focuses on smooth running of
daily IT services and continuous improvement which aligns the life cycle
stages thus offering room for best practices and improvement in value
delivery (Office of Government Commerce, 2010).
A Service Level Agreement (SLA) is a blue-print which governs service
provision parameters between the service provider and the client (University
of California, 2012). Mainly, a SLA consists services being provided by the
IT service provider and how they will deliver them (they must meet user
requirements and standards agreed upon by parties involved and be
attainable thus communication is key in all processes), definition of key
performance parameters, assigning IT service providers personnel and users
to measure specific performance using specific metrics (continuously
monitor, manage and measure Service Level commitments), identification
of rewards or penalties levied if service delivery is being offered effectively
or theyre failing to render the services (SLA matrices should have
performance buffers to allow for the recovery from breaches) (Dube &
Gulati, 2005; Lahti & Peterson, 2007).
IJCSBI.ORG
4. METHODOLOGY
The research questions in this study examine ITS personnel services
delivery in relation to SLAs, OLAs and ITSCs. Research approach is the
way the researcher approaches the research either by gathering data and
formulates a theory or the researcher develops a theory and hypotheses and
then tests or validate it. An inductive approach was adopted since it allowed
the researchers to develop a theory during data analysis of the collected data
(Saunders, Lewis, & Thornhill, 2009). The researchers used questionnaires
to carry out the research since they facilitated saturation, the questionnaires
were distributed in proportion to personnel in each ITS department team. 20
Questionnaires were distributed in the Hardware Section, 7 in the Software
section and 7 in the Networking section. The response rates were 80
percent, 71.43% and 85.71% percent respectively. The data was coded
manually.
5. RESULTS
The hardware section team is not aware of any agreements with the
software team and the networking department which ensure appropriate
service level is met for particular services within the ITS department. If
OLAs agreements are in place personnel felt that the ITS department
Director and or other senior officers should facilitate and maintain these
agreements since they increase efficiency and they allow alignment of work
processes with organizational objectives.
The hardware section team is not aware of any agreements with the
software and networking team which define the level of service the students
and staff members should receive and this should be led by the chief
technician. Personnel act on intuition to work and tasks when called upon or
infer to those which are his/her job description. All respondents agreed to
the notch that the adoption of SLAs will improve service delivery to the
clients and helps in setting boundaries on personnels duties and how they
would execute them with confidence. Furthermore, it results in process
standardization and improved accuracy in execution of tasks. 10% of the
respondents strongly agree, 60% Agree, 15% are Neutral and also 15%
Disagree that the use SLAs will improve and differentiate services by
defining performance and its measures and this will help in building
actionable performance tracking and controls.
There is no policy about IT services currently in offer and ready to be
delivered which the respondents felt they should be monitored by
supervisors responsible for a specific services being offered. In hardware
maintenance, personnel from other departments are called upon to offer all
related activities on ad-hoc basis. ITSCs offers a platform to evaluate
services being offered if theyre meeting the required standard. Top
IJCSBI.ORG
management such as directors and supervisors are key stakeholders in
implementation of IT service management.
The networking section team do not have any agreements with the software
and hardware teams to ensure appropriate service level is met for particular
services in the ITS department. The service level, which students and staff
should receive is not defined such as the uptime and download speed
available in both the wireless and wired network. Staff portal services and
the students electronic learning (E-Learning) accounts being monitored by
the software team are dependent upon network availability and the server
capacity which is the responsibility of the networking and hardware section
respectively even though there are no OLAs among the departments
concerned. Staff and students are informally consulted on their requirements
on the services being offered by the ITS department. Students and staff
members should be given a platform to request additional functionalities
add-on on their E-Learning and staff portal services accounts.
IJCSBI.ORG
upgrading to mobile site, modification of functionalities on the webpage,
and phasing out of specific service should be communicated. Figure 1
overleaf shows the developed model.
OPERATIONAL LEVEL
AGREEMENTS (IT service provider
centric)
SERVICE LEVEL
AGREEMENTS
OLAs DRIVING Identify key services and processes to
FORCES achieve the required goal.
Leadership support Define services in qualitative and
Setting specific quantitative form.
performance benchmarks Monitor the key services and processes
Rewards and recognition while corrective measures are being taken
or penalties in where necessary.
relationship response on
adopting OLAs SERVICE CATALOGUE
(Customer centric)
Education and awareness
campaigns to ITS Details of services and products
department sections offering
personnel Give reports on website availability
Ensure effective (response time, uptime percentage
feedback mechanism and etc.)
communication. Support services (e.g. installation
Definition of services of preliminary software, mobile
required to deliver browser support/types of mobile
services phones compatible)
Explicitly define Key policies
responsibilities of IT Terms and conditions
service provider and
Service Level Agreements (SLAs)
recipient
Key future plans (upgrading to
mobile, modification of
functionality etc. or phasing out of
a service
IJCSBI.ORG
6. CONCLUSIONS
An enabling collaborative approach to quality improvement should be
explored by the ITS teams while involving their clients (staff and students)
so that their needs are satisfied. In achieving ITSM, goals must be
benchmarked and reviewed by the monitoring and evaluation committee
being steered by the project manager. The committee must ensure
availability of human and financial resources for example through lobbying
top management support and training of employees. In addition, the
committee should facilitate a cyclical communication system with
stakeholders and top management so as to ensure their support and
commitment even during the review process. The institutional goals, vision
and mission should be aligned with ITSM strategy adopted. A service
catalogue which acts as a blue-print to clients in understanding and making
an informed decision about the services they use or intends to use must
always be availed to clients, and also it acts as a benchmark for quality
assurance on services the ITS department offers to clients.
OLA between IT service provider and a procurement department or other
departments to obtain hardware or other resources in agreed times and
between a service desk and a support group to provide incident resolution in
agreed times should be defined to ensure appropriate service level is met
(Rudd, 2010). Adoption of OLAs will result in better service delivery and
management of duties and responsibilities. Universities must integrate
various IT teams within departments across the various campuses while
explicitly defining implementation of SLAs, OLAs and ITSCs and also
emphasise on performance reporting which must be facilitated by a team
leaders from all IT sections. Additionally, institutes must identify
facilitating and clogging conditions for successful ITSM and this can be
necessitated through conducting seminars and or workshops on relevant IT
aspects. Conducting post training evaluation on deliberations on ITSM will
help in continuous improvement in service delivery. Relating COBIT and
ITIL to IT service management constructs (OLAS, SLAs and ITSCs)
presents an interesting area for further research.
REFERENCES
[1] Almeroth, K.C. and Hasan, M., 2002. Management of Multimedia on the Internet: 5th
IFIP/IEEE Proceedingds of the International Conference on Management of
Multimedia Networks and Services, MMNS 2002, Santa Barbara, CA, USA, October 6-
9, 200. CA: Springer, p.356.
[2] Bon, J. van et al., 2007. IT Service Management: An Introduction. Van Haren
Publishing, p.514.
[3] Dube, D.P. and Gulati, V.P., 2005. Information System Audit and Assurance. Tata
McGraw-Hill Education, p.671.
[4] Griffiths, R., Lawes, A. and Sansbury, J., 2012. IT Service Management: A Guide for
ITIL Foundation Exam Candidates. BCS, The Chartered Institute for IT, p.200.
IJCSBI.ORG
[5] Hiles, A., 2000. Service Level Agreements: Winning a Competitive Edge for Support &
Supply Services. Rothstein Associates Inc, p.287.
[6] Lahti, C.B. and Peterson, R., 2007. Sarbanes-Oxley IT Compliance Using Open Source
Tools. Syngress, p.466.
[7] Moeller, R.R., 2013. Executives Guide to IT Governance: Improving Systems
Processes with Service Management, COBIT, and ITIL. John Wiley & Sons, p.416.
[8] Office of Government Commerce, 2010. Introduction to the ITIL service lifecycle. The
Stationery Office, p.247.
[9] Rudd, C., 2010. ITIL V3 Planning to Implement Service Management. The Stationery
Office, p.320.
[10] Saunders, M., Lewis, P., & Thornhill, A. (2009). Research methods for business
students. (5th Edition, Ed.) Pearson Education Limited, Essex, England.
[11] Troy, D. M., Rodrigo, F., & Bill, F. (2007). Defining IT Success Through the Service
Catalog: A Practical Guide about the Positioning, Design and Deployment of an
Actionable Catalog of IT Services (1st Edition, Ed.). US: Van Haren Publishing.
[12] University of Birmingham, 2014. IT Services - University of Birmingham. [Online]
Available at: <http://www.birmingham.ac.uk/university/professional/it/index.aspx>
[Accessed 18 Mar. 2014].
[13] University of California, 2012. ITS Service Management: Key Elements. [online]
Available at: <http://its.ucsc.edu/itsm/servicemgmt.html> [Accessed 18 Mar. 2014].
Zhou, M., Ruvinga, C.,Musungwini, S. and Zhou, G., T., 2014. A Model for
Implementation of IT Service Management in Zimbabwean State
Universities. International Journal of Computer Science and Business
Informatics, Vol. 14, No. 1, pp. 58-65.
IJCSBI.ORG
Hasan Naderi
Department of Computer Engineering
Iran University of Science and Technology
Tehran, Iran
ABSTRACT
Among all patterns occurring in tree database, mining frequent tree is of great importance.
The frequent tree is the one that occur frequently in the tree database. Frequent subtrees not
only are important themselves but are applicable in other tasks, such as tree clustering,
classification, bioinformatics, etc. In this paper, after reviewing different methods of
searching for frequent subtrees, a new method based on inverted index is proposed to
explore the frequent tree patterns. This procedure is done in two phases: passive and active.
In the passive phase, we find subtrees on the dataset, and then they are converted to strings
and will be stored in the inverted index. In the active phase easily, we derive the desired
frequent subtrees by the inverted index. The proposed approach is trying to take advantage
of times when the CPU is idle so that the CPU utilization is at its highest in in evaluation
results. In the active phase, frequent subtrees mining is performed using inverted index
rather than be done directly onto dataset, as a result, the desired frequent subtrees are found
in the fastest possible time. One of the other features of the proposed method is that, unlike
previous methods by adding a tree to the dataset is not necessary to repeat the previous
steps again. In other words, this method has a high performance on dynamic trees. In
addition, the proposed method is capable of interacting with the user.
Keywords: Tree Mining, Inverted Index, Frequent pattern mining, tree patterns.
1. INTRODUCTION
Data mining or knowledge discovery deals with finding interesting patterns
or information that is hidden in large datasets. Recently, researchers have
started proposing techniques for analyzing structured and semi-structured
datasets. Such datasets can often be represented as graphs or trees. This has
led to the development of numerous graph mining and tree mining
algorithms in the literature. In this article we present an efficient algorithm
for mining trees.
IJCSBI.ORG
Data mining has evolved from association rule mining, sequence mining, to
tree mining and graph mining. Association rule mining and sequence mining
are one-dimensional structure mining, and tree mining and graph mining are
two-dimensional or higher structure mining. The applications of tree mining
arise from Web usage mining, mining semi-structured data, and
bioinformatics, etc.
Basic and fundamental ideas of tree mining, roughly since the early '90s
were seriously discussed and during this decade were completed. These can
be stated that the origin and beginning of these ideas is their application
especially on the web. First, some essential and basic concepts are
described, and then describe the proposed method and finally the results will
be evaluated.
2. Related Works
2.1 Pre Order Tree Traversal
There are several ways to navigate through the ordered trees; pre order
traversal is one of the most important and most widely used of them. In this
way, we are acting like Depth First Search algorithm. This means that on the
tree like T starting from the root, then the left child and finally the right
child is navigating; this method is done recursively on all nodes of the tree.
2.2 Post Order Tree Traversal
It is also among the most important and widely used methods of ordered
trees traversal. In this method, we first on the tree like T starting from the
left child, then right child and finally the root is navigating, the operation is
performed recursively on all nodes of the tree.
Using either method, the above display, we can assign a number to each of
the nodes that in fact, it is represents a time to meet each node. If we use the
Post Order Traversal, that number is called PON.
2.3 RMP, LMP
LMP is the acronym Left Most Path represents a path from the root to the
leftmost leaf and the RMP is the acronym Right Most Path represents a path
from the root to the rightmost leaf.
2.4 Prfer Sequence [23]
This algorithm was introduced in 1918 and used to convert the tree to string.
The algorithm works as follows in the tree like T, in every step the node
with the smallest label has been removed and label the parent node of this
tree is added to the Prfer Sequence. This process is repeated n-2 times to 2
nodes remain.
IJCSBI.ORG
2.5 Label Sequence
The next concept is Label Sequence. This sequence is produced according to
the Post Order Traversal. In other words, in the Post Order Traversal, label
of each node that will be scanned, add to the sequence.
2.6 Support
Simply this implies that the S pattern has been repeated several times in the
tree T.
(1)
IJCSBI.ORG
4. The proposed approach
This procedure is done in two phases: Passive and Active. In the Passive
phase, first we need to find all subtrees available in all trees and then must
be store in Inverted Index. In the Active phase, simply use it and will extract
frequent tree patterns.
4.1 Passive Phase
This phase is done in two stages, in first stage; we must find all subtrees of
every tree in dataset then they will be converted to a string called that tree
and in second stage; produced strings in the first stage should be stored in
the Inverted Index.
4.1.1 First stage of Passive phase
The first important point is that in each tree, each node label in the tree can
be repeated many times, but every node in every tree has a unique label; to
solve this problem, we use the method of Prfer Sequence. This means that
each tree can be traced to Post Order and In fact, Prfer Sequence Algorithm
works based on the PON. As a result, each node label of a tree will be
marked with a unique number.
The next issue is that the Prfer Sequence able to cover all the nodes,
therefore, the algorithm implementation process rather than n-2 steps
process will continue until n steps and rather than the parent label of the last
node, put the number 0. In Figure 1 you can see an example of this method
is that the purpose of the NPS is Prfer Sequence that has been achieved
using Post Order.
The next thing is that every subtree should be displayed uniquely; to this
end, must obtain CPS for each node. In fact, CPS will merge Prfer
Sequence and Label Sequence. In other words, CPS(T) = ( NPS,LS )(T).
CPS can uniquely display a rooted and labeled tree. As you can see in
Figure 1, the T1 tree can be displayed uniquely using two strands that are
complementary.
Figure 1. An example of the Prfer Sequence and Label Sequence for T1 tree
IJCSBI.ORG
The next thing that we need to ensure that in each tree can produce all
subtrees and each subtree is created only once, for this purpose, we use the
LMP to generate the subtree. This means that if we show the T tree using
Prfer sequence and n is the subtree, A node such as v that is to be added to
the n should be included in the LMP of the T tree and since the PON is built
on the Prfer sequence, just the v node should be after the last node of the n
and attached to that in the Prfer sequence of the T tree. So is guaranteed to
be generated only once for each subtree and if it is done for all the nodes,
entire subtree of each tree will produce.
Now we will introduce the algorithm. The proposed algorithms for
generating subtrees and convert them into a string can be seen in Figure 2.
Insert CPS(T) in Array A
For i=n downto 1 do
{
Subtree=A[n]
Insert CPS(A[n]) in Treestringi
Sub(subtree,i,A, stack1,stack2)
}
Sub(subtree,index,A[],stack1,stack2)
c=0
t=0
For j= 1 to index-1 do
If index in A[j] then
{
stack3=stack1
stack4=stack2
subtree2=subree
while stack3 not empty
{
t++
Pop x from stack1
Pull y from stack2
Subtree2=subtree2+x
if t>0 then
{
Insert CPS(subtree2) in treestringi
Sub(subtree2,y, A[],stack3,stack4)
}
}
If c>0 then
{
Temptree push in stack1
TempIndex push in stack2
}
Temptree= a[j]
TempIndex=j
c++
Subtree=subtree+a[j]
IJCSBI.ORG
Insert CPS(subtree) in treestringi
Sub(subtree,j,A[],stack1,stack2)
while stack1 not empty
{
c--
Pull x from stack1
Pull y from stack2
Insert CPS(subtree+x) in treestringi
Sub(subtree+x,y, A[],stack1,stack2)
}
}
Figure 2. The algorithm of the subtrees generation and convert them to a string
In the following we examined the algorithm works with an example. In the
beginning starting from the first tree and CPS (T) are stored in the array A.
As a result, the array will be completed for T1 according to Figure 3.
IJCSBI.ORG
4.1.2 First stage of Passive phase
In the second stage of this phase we use the Inverted Index. Thus, the strings
created in the previous stage are inserted into the Inverted Index. CPS and
the number of occurrences of each subtree in the all trees are stored in the
Dictionary and the name of the trees that are containing the subtree will be
stored in the corresponding Posting List.
Figure 4. Part of the Inverted Index made for the collection of trees T1, T2
As can be seen in the subtrees are stored in the Dictionary and the parent
trees of the corresponding subtrees are stored in the Posting List.
4.2 Active Phase
In this phase, simply use inverted index made in the previous phase and will
extract frequent tree patterns. Simply types of queries about frequent subtree
mining to be answered quickly by using inverted index made in the previous
phase. Then we will examine several types of different queries.
4.2.1 Find the occurrence of the desired pattern in tree set
First, we are achieved CPS of the desired pattern and then search it into
Dictionary of the inverted index and easily extract the number of the
occurrence and name of trees that contain the desired pattern from the
Posting List of the inverted index. For example, to find the number of
occurrences of the S pattern on the collection of trees T1, T2 in Figure 5,
should search CPS (S) ie A0C3B3 into Inverted Index that T1 and T2 will
be the result.
IJCSBI.ORG
Figure 5. Part of the Inverted Index made for the collection of trees T1, T2
4.2.2 Find frequent subtrees in considering the Support
If we want to find some subtrees that them Support are greater than a
threshold, must find the subtrees with their occurrence compared to the total
trees is greater than the Support. So we can search in the inverted index and
easily find subtrees that the length of them Posting List compared to the
total trees is at least equal to Support.
4.2.3 Find frequent subtrees in considering the Support and minimum nodes
In this case, in addition to Support, the number of nodes is also the criterion,
so easily search in Inverted Index and only show the subtrees with the
following conditions. First, in Dictionary Length the subtree is greater than
the minimum number of nodes and Second, length of corresponding Posting
List compared to the total trees is at least equal to Support.
5. Evaluation
In this section, the proposed method will be evaluated from various aspects.
We present the experimental evaluation of the proposed approach on
synthetic datasets. In the following discussion, dataset sizes are expressed in
terms of number of trees. In the graphs is used from symbolizes Algorithm
to display proposed method. Name and details of synthetic datasets are
shown in Table 1.
Table 1. Name and details of synthetic datasets
Name Description
DS1 -T 10 -V 100
DS2 -T 10 -V 50
As shown in Table 1, the synthetic datasets DS1 and DS2 are generated
using the PAFI[28] toolkit developed by Kuramochi and Karypis (PafiGen).
Since PafiGen can create only graphs we have extracted spanning trees from
these graphs and used in our analysis. We also used minsup to analyze the
various factors. This means if the number of replicated subtree is less than
minsup value, the tree won't be indexed in Inverted Index. Minsup value is
from 1 to infinity, which is the default value is equal to 1 in the proposed
algorithm. In addition, we also use from maxnode in evaluations. Maxnode
is the symbol to specify the maximum number of nodes in each subtree in
Inverted Index. This means if the number of subtree nodes reach maxnode
amount in the proposed algorithm, production of its subtree will halt.
IJCSBI.ORG
Maxnode value is from 1 to infinity, and the default value is equal to
infinity.
5.1 Evaluating the performance of the proposed method
At the beginning, we evaluated our proposed algorithm on two synthetic
datasets DS1 and DS2. The performance of the proposed algorithm for
frequent tree minig on synthetic datasets is shown in Diagram 1. In this
experiment, the minsup equal to one and the maxnode is equal to infinity.
Given that the Subtrees are indexed in passive phase at times when the
system is idle, mining time in Inverted Index rises with a gentle slope By
increasing the number of trees. So clearly spelled out the introduced
algorithm is scalable.
10
9
8
7
mining time
6
5 DS1
4
3 DS2
2
1
0
10K 20K 30K 40K 50k
# of trees
1000
Thousands
100
10 DS1
1 DS2
0.1
0.01
0.001
2,500 500 250 50 25 5 1
minsup
IJCSBI.ORG
5.3 Evaluating effect of maxnode on usage memory
We examine the effect the maximum number of nodes in the indexed
subtrees on usage memory in passive phase. This experiment has been done
on synthetic datasets DS1 and DS2 generated by Pafi and with size 50K. In
this experiment the minsup is the default value ie 1. As can be seen, the
usage memory of the algorithm is increasing by increasing the number of
indexed nodes in each subtree.
9000
Virtual Memory (MB)
8000
7000
6000
5000
4000 DS1
3000 DS2
2000
1000
0
1 5 25 50 250 500 2,500
Maximum # of node in Subtrees
100
90
80
Cpu utilization (%)
70
60
50
TreeMiner
40
30 Algorithm
20
10
0
10K 20K 30K 40K 50K
# of trees
IJCSBI.ORG
6. Conclusions and Recommendations
In this paper, a new method based on the Inverted Index in order to frequent
pattern mining was introduced to overcome many of the disadvantages of
previous methods. One problem with existing approaches is that mainly act
as a static on the set of trees and if a new tree is added to the set of trees, all
mining operations must be done from scratch again. This problem has been
overcome by the Inverted Index in the proposed approach. This means that
all the trees are indexing in the Passive phase and if a new tree is added to
the treeset at any stage, just the tree is indexed and there is no need to repeat
the previous operations. This algorithm will result in a high performance on
a collection of dynamic trees. Another advantage of this method compared
to other methods is that it is scalable. As listed in Section 5.1, the
performance of this algorithm is not slowed by increasing the treeset. As
listed in Section 5.4, one of the most striking features of this algorithm is
efficient use of CPU. In this method, the user interaction is also present.
As Listed in Section 5.2, the number of indexed patterns increases
exponentially by decreasing minsup, while the generally patterns with low
occurrences doesn't matter to us. As a result, we can speed up indexing in
passive phase with determining the appropriate amount of the minsup. As
Listed in Section 5.3, the usage memory increases by increasing the
maximum number of nodes in the indexed subtrees, while the usually
subtrees with very large number of nodes doesn't matter to us. As a result,
we can manage the usage memory with determining the appropriate amount
of the maxnode.
REFERENCES
[1] B. Vo, F. Coenen, and B. Le, "A new method for mining Frequent Weighted Itemsets
based on WIT-trees," International Journal of Advanced Computer Research, p. 9,
2013.
[2] L. A. Deshpande and R. S. Prasad, "Efficient Frequent Pattern Mining Techniques of
Semi Structured data: a Survey," International Journal of Advanced Computer
Research, p. 5, 2013.
[3] A. M. Kibriya and J. Ramon, "Nearly exact mining of frequent trees in large
networks," Data Mining and Knowledge Discovery (DMKD), p. 27, 2013.
[4] G. Pyun, U. Yun, and K. H. Ryu, "Efficient frequent pattern mining based on Linear
Prefix tree," International Journal of Advanced Computer Research, p. 15, 2014.
[5] C. K.-S. Leung and S. K. Tanbeer, "PUF-Tree: A Compact Tree Structure for
Frequent Pattern Mining of Uncertain Data," Advances in Knowledge Discovery and
Data Mining, p. 13, 2013.
[6] A. Fariha, C. F. Ahmed, C. K.-S. Leung, S. M. Abdullah, and L. Cao, "Mining
Frequent Patterns from Human Interactions in Meetings Using Directed Acyclic
Graphs," Advances in Knowledge Discovery and Data Mining, p. 12, 2013.
IJCSBI.ORG
[7] J. J. Cameron, A. Cuzzocrea, F. Jiang, and C. K. Leung, "Mining Frequent Itemsets
from Sparse Data," Web-Age Information Management, p. 7, 2013.
[8] G. Lee, U. Yun, and K. H. Ryu, "Sliding window based weighted maximal frequent
pattern mining over data streams," Advances in Knowledge Discovery and Data
Mining, p. 15, 2014.
[9] H. He, Z. Yu, B. Guo, X. Lu, and J. Tian, "Tree-Based Mining for Discovering
Patterns of Reposting Behavior in Microblog," Advanced Data Mining and
Applications, p. 13, 2013.
[10] U. Yun, G. Lee, and K. H. Ryu, "Mining maximal frequent patterns by considering
weight conditions over data streams", Advances in Knowledge Discovery and Data
Mining, 2014.
[11] B. Kimelfeld and P. G. Kolaitis" ,The complexity of mining maximal frequent
subgraphs," Proceedings of the 32nd symposium on Principles of database systems, p.
12, 2013.
[12] B. Vo, F. Coenen, and B. Le, "A new method for mining Frequent Weighted Itemsets
based on WIT-trees," International Journal of Advanced Computer Research, p. 9,
2013.
[13] L. A. Deshpande and R. S. Prasad, "Efficient Frequent Pattern Mining Techniques of
Semi Structured data: a Survey," International Journal of Advanced Computer
Research, p. 5, 2013.
[14] A. M. Kibriya and J. Ramon, "Nearly exact mining of frequent trees in large
networks," Data Mining and Knowledge Discovery (DMKD), p. 27, 2013.
[15] G. Pyun, U. Yun, and K. H. Ryu, "Efficient frequent pattern mining based on Linear
Prefix tree" International Journal of Advanced Computer Research, p. 15, 2014.
[16] C. K.-S. Leung and S. K. Tanbeer, "PUF-Tree: A Compact Tree Structure for
Frequent Pattern Mining of Uncertain Data," Advances in Knowledge Discovery and
Data Mining, p. 13, 2013.
[17] A. Fariha, C. F. Ahmed, C. K.-S. Leung, S. M. Abdullah, and L. Cao, "Mining
Frequent Patterns from Human Interactions in Meetings Using Directed Acyclic
Graphs," Advances in Knowledge Discovery and Data Mining, p. 12, 2013.
[18] J. J. Cameron, A. Cuzzocrea, F. Jiang, and C. K. Leung, "Mining Frequent Itemsets
from Sparse Data," Web-Age Information Management, p. 7, 2013.
[19] G. Lee, U. Yun, and K. H. Ryu, "Sliding window based weighted maximal frequent
pattern mining over data streams," International Journal of Advanced Computer
Research, p. 15, 2014.
[20] H. He, Z. Yu, B. Guo, X. Lu, and J. Tian, "Tree-Based Mining for Discovering
Patterns of Reposting Behavior in Microblog," Advanced Data Mining and
Applications, p. 13, 2013.
[21] U. Yun, G. Lee, and K. H. Ryu, "Mining maximal frequent patterns by considering
weight conditions over data streams," International Journal of Advanced Computer
Research, 2014.
[22] B. Kimelfeld, and P. G. Kolaitis, "The complexity of mining maximal frequent
subgraphs," Proceedings of the 32nd symposium on Principles of database systems,
p. 12, 2013.
[23] H. Prfer. Prfer sequence. Available:
http://en.wikipedia.org/wiki/Pr%C3%BCfer_sequence
[24] C. D. Manning, P. Raghavan, and H. Schtze, An Introduction to Information
Retrieval. Cambridge, England: Cambridge University Press, 2008.
IJCSBI.ORG
[25] Y. Xiao, J.-F. Yao, Z. Li, and M. H. Dunham, "Efficient data mining for maximal
frequent subtrees," Proceedings of 3rd IEEE International Conference on Data
Mining, p. 8, 2003.
[26] S. Tatikonda, S. Parthasarathy, and T. Kurc, "TRIPS and TIDES: New Algorithms for
Tree Mining," Proceedings of 15th ACM International Conference on Information
and Knowledge Management (CIKM), p. 12, 2006.
[27] F. D. R. Lopez, A.Laurent, P.Poncelet, and M.Teisseire, "FTMnodes: Fuzzy tree
mining based on partial inclusion," Advanced Data Mining and Applications, pp.
22242240, 2009.
[28] Kuramochi and Karypis. Available: http://glaros.dtc.umn.edu/gkhome/pafi/overview/
[29] M. J. Zaki, "Efficiently Mining Frequent Trees in a Forest," Proceedings of the 8th
ACM SIGKDD International Conference on Knowledge Discovery and Data Mining
(SIGKDD), Edmonton, Canada, p. 10, 2002.
IJCSBI.ORG
ABSTRACT
The main objective of this work is to develop a practical approach to improve customer
satisfaction, which is generally regarded as the pillar of customer loyalty to the company.
Today, customer satisfaction is a major challenge. In fact, listening to the customer,
anticipating and properly managing his claims are stone keys and fundamental values for
the enterprise. From a perspective of the quality of the product, skills, and mostly, the
service provided to the customer, it is essential for organizations to differentiate
themselves, especially in a more competitive world, in order to ensure a higher level of
customer satisfaction. Ignoring or not taking into account customer satisfaction can have
harmful consequences on both the economic performances and the organizations image.
For that, it is crucial to develop new methods and have new approaches to THE
PROBLEMATIC customer dissatisfaction, by improving the services quality provided to
the costumer. This work describes a simple and practical approach for modeling customer
satisfaction for organizations in order to reduce the level of dissatisfaction; this approach
respects the constraints of the organization and eliminates any action that can lead to loss of
customers and degradation of the image of the organization. Finally the approach presented
in this document is tested and evaluated.
Keywords: Approach, Evaluation, Quality, Satisfaction, Test of homogeneity, Validation.
1. INTRODUCTION
Does the company have the most meaningful information at the right time
to make the best possible business decisions? is the question most
companies want to answer. The purpose of a company is to create and
keep a customer (Levitt, 1960): through this declaration, the important
phases of the life cycle of the customer management, which are acquiring
costumers and ensuring their loyalty are clearly identified. Companies are
moving towards customer oriented management and focus on the life
cycle of their customers. According to Moisand 2002, the life cycle of the
customer is defined as the time interval between the moments for a costumer
to change its status from being a new costumer to the status of a
lost/former customer.
IJCSBI.ORG
In a context of globalized and very competitive market, where the
departments moved from a more classic level of management (cost
centered) to a value centered approach, the mission of the decision-makers
has evolved from proposing services and strategic partnerships to value
creation. To achieve this goal its necessary to have all the data to enlighten
the past, to clarify the present in order to predict the future by avoiding to be
confronted with gray areas (caused by lack of information). Business
intelligence includes all IT solutions (methods, facilities and tools) used to
pilot the company and help to make decisions.
IJCSBI.ORG
decide. So IS is a representation of reality, it leads to coordinate the
activities of the company.
IJCSBI.ORG
2. PROPOSED APPROACH
Standish Group (Valery, 2001) did a study which was conducted
internationally and evaluated the success and failure of IT projects.
Accumulated data over the past ten years are based on a sample of 50,000
projects. This study has identified three levels of evaluation of a project:
The success of project : it is characterized by a system delivered in
hours and on time, for a cost within budget and fully compliant to
the specifications;
The failure of project : it is characterized by the cessation of the
project ;
Finally, the partial success or partial failure of a project: it is
characterized by the late delivery of a system partially responsive,
especially in terms of business scope, the specifications and a cost of
up to 200% of the original budget.
2.1 Statement
This study shows that the customer satisfaction is not always reached,
perceived quality tend towards a desired quality presents a real challenge.
Within the company, quality is increasingly focused on customer
satisfaction. To win contracts, business leaders rely more on quality than
price advantages. Staff involvement, with listening to the customer, is a key
element for the success of a quality approach. The latter is the
implementation of all the resources available to an establishment to provide
a service that meets the needs and expectations of customers. From the
customer perspective, a warm welcome and quality service is "normal", it is
lack of quality which is penalizing to him.
IJCSBI.ORG
a) To develop teams skills: do additional training on IT tools to
mount the teams skills.
IJCSBI.ORG
g) To reach out to customers on the Internet : The benefit may also
be provided on the Internet by another customer, a social network
(Twitter, Facebook ...), Make social media a true extension of
customer service, with employees able to participate in discussions
and respond directly to customer requests on these media.
3. EVALUATION AND VALIDATION OF THE APPROACH
Consider the case of a service company that manages the work of many
potential customers as "France Gas". the latter signed a contract with the
host company specifying the clauses that must be respected and among the
latter is the rate of customer satisfaction which should reach 92% and this
percentage is established post-agreement between two parties, and if that
percentage is not met, a penalty will be done due to customer
dissatisfaction. A development team of the host company supports the
realization of applications for "France Gas". This team should produce 22
applications monthly with the dissatisfaction rate should not exceed 8% (2
applications per month). The cause of client dissatisfaction is due to the
following:
Application does not answer the need or generate unexpected errors
after delivery
Timeout
The first - the null hypothesis or Ho. note: H0 : "Qr =Qp ".
Qr : the proportion of customer satisfaction desired
Qp : the real percentages of satisfaction.
The second, the alternative hypothesis H1 : "Qp <Qr "
IJCSBI.ORG
3.1.1 Example1: April 2013
The team was unable to process only 10 simple applications. The customer
sent feedback to present his degree of satisfaction. There are 3 kind of
response: NS (Not Satisfied, S: Satisfied, N: Neutral)
IJCSBI.ORG
NS (unsatisfied)
50% N (neutral)
40%
Ps (t 0 )=P(Xt0 = S)=0.5
PNS (t 0 )=P(Xt0 = NS)=0.4
PN (t 0 )=P(Xt0 = N)=0. 1
As Qr =92% and the hypothesis H0 = "Qr =Qp " and H1 ="Qr <Qp . We use
here one-tailed left test.
Qp Qr
If > t so we accept the hypothesis H0 and we reject H1 with
Q r (1 Q r )
n
error risk =5%
So we accept the hypothesis H1 ="Qp <Qr " and we reject H0 = "Qr =Qp " with
error risk =5%. And the observed difference is significant.
IJCSBI.ORG
Table 3. Customers feedback of December 2013
APPLICATIONS SATISFACTION REASONS OF
(S, NS, N) DISSATISFACTION
1 MSC_CASP69 NS timeout
2 MSC_MDX NS timeout
3 Woodmac S ------
4 Whoswho s ------
5 Adobe Air Installer s ------
6 WinZip s -------
7 MSC_SetupDemdet s -------
8 Jabber s --------
9 TrendMicro_Office s --------
10 ORG+ s --------
11 QlikView s ---------
12 Q4- Engica N
13 TMS N
14 MSCLink_Core s ---------
15 MIPS s ---------
16 Rsclientprint NS Application does not work
correctly
17 TextPad s ---------
18 MSC_DMX s ---------
19 MSC_MSCOMCT2 NS timeout
20 Add-in Excel S ----------
21 Pre-req Excel S ----------
22 Ios S ------
IJCSBI.ORG
NS (unsatisfied) 4 18.18%
N (neutral) 2 9.09%
9% S (satisfied)
18% NS (unsatisfied)
N (neutral)
73%
Ps (t 0 )=P(Xt0 = S)=0.727
PNS (t 0 )=P(Xt0 = NS)=0.181
PN (t 0 )=P(Xt0 = N)=0. 091
So we accept the hypothesis H0 = "Qr =Qp " and we reject H1 ="Qp <Qr "
with error risk =5%. The difference between P and P0 observed is due to
sampling fluctuations.
IJCSBI.ORG
3 See Electrical Viewer 4 S ------
4 Adobe_Flash_Player s ------
5 MSC_DEPOT s ------
6 Colibri 2.0 s -------
7 Navision s -------
8 OFFICE 2013 s --------
9 Windows6.1-KB2592687 s --------
10 CheckPoint VPN s --------
11 Interlink_MSCLink s ---------
12 CrystalReportsRuntime N
13 InterlinkComponentOne s ---------
14 MSXML s ---------
15 VisualC++Redistributable s ---------
16 ReportViewer_2010 NS Application does not
work correctly
17 .Net_Framework s ---------
18 MSCLink_Core s ---------
19 MSCLink_Configuration NS timeout
20 LDOC S ----------
21 MigrationAssistantTool S ----------
IJCSBI.ORG
Ps (t 0 )=P(Xt0 = S)=0.81
PNS (t 0 )=P(Xt0 = NS)=0.14
PN (t 0 )=P(Xt0 = N)=0. 05
So we accept the hypothesis H0 = "Qr =Qp " and we reject H1 ="Qp <Qr "
with error risk =5%. The difference between P and P0 observed is due to
sampling fluctuations.
4. TEST OF HOMOGENEITY
We are faced with two samples which are most often not known whether
they are from the same source population. It is sought to test whether these
samples have the same characteristic . Two values is observed 1 and 2 ,
the difference between these two values may be due either to sampling
fluctuations or the difference of the characteristics of the two original
populations. That is to say, from the examination of two samples of size n1
and n2 , are respectively extracts of populations P1 (M1 ; 1 ) and P2 (M2 ;2 ),
these tests are used to decide between:
H0 = 1 = 2 : (we conclude the homogeneity)
H1 = 1 2 : (we conclude the heterogeneity).
In our case we test the homogeneity of 2 proportions:
f1 = proportion of units having the calculated character X in sample 1;
f2 = proportion of units having the calculated character X in sample 2;
p1 = proportion of units having the character X in the population ;
IJCSBI.ORG
p2 = proportion of units having the character X in the population .
H0 = P1 =P2 =P and H1 = P1 P2
n 1 f 1 +n 2 f 2 220.72+210.81
P is replaced by the estimator f = = = 0.764
n 1 +n 2 22+21
0.810.72
x= =0.02 > -1.645
1 1
0.7640.24( + )
22 21
4. CONCLUSIONS
The work done is to develop a practical and pragmatic approach to maximize
customer satisfaction in an organization for a given period. Therefore, an
approach has been proposed, evaluation and validation of the latter are
described above. This work opens the way to our sense towards diverse
perspectives of research which are situated on two plans: a plan of deepening of
the realized research and a plan of extension of the domain of research. In terms
of deepening of the proposed work, it would be interesting at first to use the
Markov chain to model statistically the proposed model and to propose or
develop practical tools for implementation of the proposed approach. As for
extension of the domain of the research, it would be interesting to connect this
approach to governance of information systems and to drive decision-making
system which consist to investigate the options and compare them to choose an
action that help in making decision.
REFERENCES
[1] BUFFA, Elwood. Operations Management, 3rd Ed., NY, John Wiley & Sons, 1972.
[2] FRITZSIMMONS, James A. et Mona J. FRITZSIMMONS. Service management:
Operations, Strategy and Information Technology, 3rd Ed., NY, Irwin/McGraw-Hill, 2001.
[3] Z. Adhiri, S. Arezki, A. Namir : What is Application LifeCycle Management?,
International Journal of Research and Reviews in Applicable Mathematics and Computer-
Science, ISSN: 2249 8931, December 2011
[4] http://hal.archives-ouvertes.fr/docs/00/71/95/35/PDF/2010CLF10335.pdf
[5] STEVENSON, William J. Introduction to Management Science, 2e edition, Burr Ridge,
IL., Richard D. Irwin, 1992.
[6] HILLIER, Frederick S., Mark S. HILLIER and Gerald J. LIEBERMAN. Introduction to
Management Science : A Modeling and Case Studies Approach with Spreadsheets, New
York, Irwin/McGraw-Hill, 2000.
[7] A. EL KEBBAJ et A. NAMIR. Modeling customer's satisfaction. Day of Science
Engineers, Faculty of Science Ben MSik, Casablanca july 29, 2013
[8] http://www.projectsmart.co.uk/docs/chaos-report.pdf
[9] http://info.informatique.entreprise.over-blog.com/article-approche-du-systeme-d-
information-dans-l-entreprise-69885381.html
IJCSBI.ORG
[10] http://www.hamadiche.com/Cours/Stat/Cours5.pdf
[11] S. ARIZKI : ITGovA : proposed a new approach to the governance of information
systems. PhD in Computer Science, defended at the Faculty of Sciences of Ben M'Sik
Casablanca 24/02/2013
IJCSBI.ORG
ABSTRACT
Social Networking sites have become popular in recent years, among these sites Twitter is
one of the fastest growing site. It plays a dual role of Online Social Networking (OSN) and
Micro Blogging. Spammers invade twitter trending topics (popular topics discussed by
Twitter user) to pollute the useful content. Social spamming is more successful compared to
email spamming by using social relationship between the users. Spam detection is
important because Twitter is mainly used for commercial advertisement and spammers
invade the privacy information of the user and also the reputation of the user is damaged.
Spammers can be detected using content and user based attributes. Traditional classifiers
are required for spam detection. This paper focuses on study of detecting spam in twitter.
1. INTRODUCTION
Web-based social networking services connect people to share interests and
activities across political, economic, and geographic borders. Online Social
Networking sites like Twitter, Facebook, and MySpace have become
popular in recent years. It allows users to meet new people, stay in touch
with friends, and discuss about everything including jokes, politics, news,
etc., Using Social networking sites marketers can directly reach customers
this is not only benefit for the marketers but it also benefits the users as they
get more information about the organization and the product. Twitter [1] is
one among these social networking sites. Twitter provides a micro blogging
(Exchange small elements of content such as short sentences, individual
images, or video links) service to users where users can post their messages
called tweets. Tweet can be limited to 140 characters only HTTP links and
text are allowed. Twitter user is identified by their user name optionally by
their real name. The user A starts following other users and their tweets
will appear on As page. User A can be followed back if other user desires.
Trending topics in Twitter can be identified with hash tags (#). When a
user likes a tweet he/she can retweet that message. Tweets are visible
publically by default, but senders can deliver message only to their
IJCSBI.ORG
followers. The @ sign followed by username is a reply to other user. The
most common type of spamming in Twitter is through Tweets. Sometimes it
is via posting suspicious links.
Spam [14] can arrive in the form of direct tweets to your Twitter inbox.
Unfortunately spammers use twitter as a tool to post malicious link, send
spam messages to legitimate users. Also they spread viruses, or simply
compromise the system reputation. Twitter is mainly used for commercial
advertisement, and spammers invade the privacy information of the user and
also the reputation of the user is damaged. The attackers advertise on the
Twitter for selling products by offering huge discount and free products.
When users try to purchase these products, they are asked to provide
account information which is retrieved by attackers and they misuse the
information. Therefore spam detection in any social networking sites are
important.
2. RELATED WORKS
McCord et al., [1] has proposed user based and content based features to
facilitate spam detection.
IJCSBI.ORG
Secondly, they compare four traditional classifiers namely Random forest,
Support Vector Machine (SVM), Nave Bayesian and K-nearest neighbor
classifiers which are used to detect Spammers. Among these classifiers
Random Forest is found to be effective but this classifier can deal with only
imbalanced data set (data set with more regular users than spammers). Alex
HaiWang [2] considered 'Follower Friend relationship in his paper. A
Direct Social Graph is modeled. The author considers content based and
graph based features to facilitate spam detection.
In the above figure user A is following user B, user B and user C are
following each other. i.e., User B and user C are mutual friends, and User A
and User C are strangers. The graph based features considered are number
of followers, number of friends, and the reputation of a user.
The classifier used in this paper to detect spam is Nave Bayesian classifier
[10]. It is based on Bayes theorem which is given by the equation,
= (( ) () (2.2)
The twitter account is considered as vector X and each account is assigned
with two classes Y spam and non-spam, the assumption is that the features
are considered to be conditionally independent. This classifier is easy to
implement, it requires small amount of training data set. But, conditionally
independence may lead to loss of accuracy. This classifier cannot model
independency.
IJCSBI.ORG
Twitter Account Features:
Zi Chu et al., [13] review some of the classification features to detect
spammers. These include tweet level features and account level features.
The tweet level features include Spam Content Proposition i.e. tweet text
checked with spam word list and the final landing URL is checked. The
account level features include account Profile which is the self description
of short description text and homepage URL and check whether the
description contains spam words.
Fabricio Benevento et al., [3] have considered the problem of detecting
spammers. In this paper approximately 96% legitimate users and 70%
spammers were correctly classified. Like [1] user based and content
attributes are considered. To detect spammers with accuracy, confusion
matrix is introduced
Evaluation Metrics:
Precision: It is defined as the ratio of the number of users classified
correctly to the total predicted users and is given by the
equation,
, = ( + ) (2.3)
Recall: It is defined the ratio of number of users correctly classified to the
number of users and is given by the equation,
, = ( + ) (2.4)
F-measure: It is the harmonic mean between precision and recall and it is
given by the equation,
_ = 2 ( + ) (2.5)
IJCSBI.ORG
The classifier used to detect spam is SVM. It is a state of the art method in
classification and in this approach they use non-linear SVM with the Radial
Basic Function kernel that allow SVM to perform with complex boundaries.
The biggest limitation of the support vector approach lies in choice of the
kernel and high algorithmic complexity. This approach mainly focuses on
detecting spam instead of spammers so that it can be useful in filtering
spam. Once a spammer is detected it is easy to suspend that account and
block the IP address but spammers continue their work from other new
account.
Puneeta Sharma and Sampat Biswas [4] proposed two key components (1)
identifying timestamp gap between two successive tweets and (2)
identifying tweet content similarity. They found two common techniques
used by spammers (1) Posting duplicate content by modifying small content
of the tweet; (2) Post spam within short intervals. Spam Identification
approach included BOT Activity Detection and Tweet Similarity Index.
Twitter data can be filtered in various ways by user id, by keyword, many
spammers post spam messages using BOT (computer program), reducing
the frequency between consecutive tweets. To calculate timestamp between
tweets, they first cluster tweets based on user id and sort by increasing
timestamp.
Cluster tweets
Cluster Tweets based on user id
YES NO
Gap < 10s
Spam Non
Spam
IJCSBI.ORG
Cluster tweets
Cluster Tweets based on user id
Create Buckets
for Similar
Tweets
YES NO
BucketSize
>1
Spam Non
Spam
IJCSBI.ORG
Levenshtein distance
It is a string metric measurement for calculating the difference
between two sequences or text. Informally, the Levenshtein distance
between two words is the minimum number of single-character edits
required to change one word into the other including insertion, deletion, and
substitution. The edit distance phrase is often used to refer Levenshtein
distance. The distance is zero if the strings are equal. For example, the
Levenshtein distance between "sitter" and "sitting" is 3
Sitter sittin (substitution of "i" for "e")
Sitter sittin (substitution of "r" for "n")
Sittin sitting (insertion of "g" in the end).
This Levenshtein distance is used to find out the duplicate tweets i.e. if the
tweets are duplicates then the distance is zero.
Jaccard Index
It is also called Jaccard similarity coefficient. It is used for comparing
diversity and similarity of sample sets.
||
J (A, B) = || (2.6)
Jaccard Distance measures dissimilarity between sample sets which is
obtained by subtracting the jaccard coefficient from 1.
, = 1 (, ) (2.7)
Dolvara Gunatilaka [9] discusses about two privacy issues. First is users
identity or users anonymity. The second issue is about user profile or
personal information leakage.
User anonymity
It is that in many social networking sites users use their real name to
represent their account. There are two methods to expose users anonymity:
(1) de-anonymization attack and (2) neighborhood attack [15]. In the first
one, the users anonymity can be revealed by history stealing and group
membership information while in the second one, the attacker finds the
neighbors of the victim node. Based on users profile and personal
information, attackers are attracted by users personal information like their
name, date of birth, contact information, relationship status, current work
and education background.
There can be leakage of information because of poor privacy settings. Many
profiles are made public to others i.e. anyone can view their profile. Next is
IJCSBI.ORG
leakage of information through third party application. Social networking
sites provide an Application Programming Interface (API) for third party
developers to create applications. Once users access these applications the
third party can access their information automatically.
Social Worms
It discuss about some of the social worms. Among those worms Twitter
worm is one of the popular worms.
Twitter worm: It is a term to describe worms that are spreading through
twitter. There are many versions and two worms that are discussed in this
paper are:
Profile Spy worm: This worm spreads by posting a link that downloads a
third party application called Profile Spy (a fake application). When users
try to download the application they need to fill some personal information
which allows attacker to obtain users information. Once account is infected,
it will continuously tweet malicious messages to their followers. Next
twitter worm is Google worm which uses shortened Google URL that tricks
the users to click the link. The fake link will redirect users to a fake anti-
virus website. The website will provide a warning saying that computer got
affected and allows user to download the fake antivirus which is actually a
malicious code.
Restrictions in Twitter
Some of the restrictions considered in twitter [9] are: The user must not
follow large number of users in a short time.
a. Unfollowing and following someone repeatedly.
b. Small number of followers when compared to the amount
of following.
IJCSBI.ORG
c. Duplicate tweets or updates.
d. Update consisting of only links.
The distance between two users is calculated as follows [5][6] when two
users are directly connected by an edge, the distance is one. This means that
the two users are friends. When the distance is greater than one, they have
common friends but not friends themselves. Next the connectivity
represents the strength of the relationship. The way to measure connectivity
is counting the number of paths. Hence, the connectivity between a
spammer and a legitimate user is weaker. The problem of this system is that
it identifies messages as normal if it comes from infected friends.
Sometimes attackers may send spam messages from legitimate accounts by
stealing passwords.
Content based spam filter [10] This filter works on words and phrases of
the email content i.e. associates the word with a numeric value. If this value
crosses certain threshold it is considered as spam. This can detect only valid
words with correct spellings. This filter uses bayes theorem for detecting the
spam content.
Word stemming or word hashing technique this filter [12] extracts the
stem of the modified word so that efficiency of detecting spam content can
be improved. Rule-based word stemming algorithm is used for spam
detection. Stemming is an algorithm that converts a word into related form.
One such transformation is conversion of plurals into singular, removing
suffixes.
IJCSBI.ORG
3. CONCLUSIONS
Spammers are the major problem in any online social networking sites.
Once a spammer is detected it is easy to suspend his/her account or block
their IP address. But they try to spread the spam from other account or IP
address. Hence it is recommended to check for spam content in a tweet in
the server. If any content matches the spam words present in the data set it is
prevented from being displayed. Accuracy is being evaluated in classifying
the spam content. Many traditional classifiers are present in classifying
spammers from legitimate users but many classifiers wrongly classify non-
spammers as spammers. Hence it is efficient to check for spam content in
tweets.
REFERENCES
[1] M. McCord, M. Chuah, Spam Detection on Twitter Using Traditional Classifiers.
Lecture Notes in Computer Science, Volume 6906, pp. 175-186, September 2011.
[2] A. H. Wang, Dont Follow me Spam Detection in Twitter, Security and
Cryptography (SECRYPT). Proceedings of 5th International Conference on Security
and Cryptography, July, 2010.
[3] Fabricio Benevenuto, Gabriel Magno, Tiago Rodrigues, and Virgilio Almeida,
Detecting Spammers on Twitter. CEAS 2010 Seventh annual Collaboration,
Electronic messaging, Anti Abuse and Spam Conference, July 2010.
[4] Puneeta Sharma and Sampat Biswas,Identifying Spam in Twitter Trending Topics.
American Association for Artificial Intelligence, 2011.
[5] Levenshtein_distance, http//en.wikipedia.org/wiki/Levenshtein_distance.
[6] Jaccard_Index, http//en.wikipedia.org/wiki/Jaccard_index.
[7] Jonghyuk Song, Sangho Lee and Jong Kim, Spam Filtering in Twitter using Sender-
Receiver Relationship. Recent Advances in Intrusion Detection, Lecture Notes in
Computer Science, Volume 6961, pp 301-317, 2011
[8] D. Karthika Renuka, T. Hamsapriya, Email classification for Spam Detection using
Word Stemming. International Journal of Computer Applications 1(5), pg.4547,
February 2010.
[9] Twitter_Spam_Rules, http//support.twitter.com/articles/64986-reporting-spam-on-
twitter.
[10] S. L. Ting, W. H. Ip, Albert H. C. Tsang - Is Nave Bayes a Good Classifier for
Document Classification?. International Journal of Software Engineering and Its
Applications, Vol. 5, No. 3, July 2011.
[11] R. Malarvizhi, K. Saraswathi. "Content - Based Spam Filtering and Detection
Algorithms - An Efficient Analysis & Comparison". International Journal of
Engineering Trends and Technology (IJETT), Volume 4, Issue 9, Sep 2013.
[12] N.S. Kumar, D. P. Rana, R. G. Mehta, Detecting E-mail Spam Using Spam Word
Associations, International Journal of Emerging Technology and Advanced
Engineering , Volume 2, Issue 4, April 2012.
IJCSBI.ORG
[13] Zi Chu, Indra Widjaja, Haining Wang, Detecting Social Spam Campaigns on
Twitter. Lecture Notes in Computer Science, Volume 7341, pp. 455-472, 2012.
[14] Chris Grier, Kurt Thomas, Vern Paxson, Michael Zhang, @spam: The Underground
on 140 Characters or Less. Proceedings of the 17th ACM conference on Computer and
Communications Security, ACM New York, NY, USA, 2010.
[15] Bin Zhou and Jian Pei, Preserving Privacy in Social Networks Against Neighborhood
Attacks,. Data Engineering, IEEE 24th International Conference, April 2008.