Академический Документы
Профессиональный Документы
Культура Документы
International Journal of
Computer Science
& Information Security
Please consider to contribute to and/or forward to the appropriate groups the following opportunity to submit and publish
original scientific results.
The topics suggested by this issue can be discussed in term of concepts, surveys, state of the art, research,
standards, implementations, running experiments, applications, and industrial case studies. Authors are invited
to submit complete unpublished papers, which are not under review in any other conference or journal in the
following, but not limited to, topic areas.
See authors guide for manuscript preparation and submission guidelines.
Indexed by Google Scholar, DBLP, CiteSeerX, Directory for Open Access Journal (DOAJ), Bielefeld
Academic Search Engine (BASE), SCIRUS, Scopus Database, Cornell University Library, ScientificCommons,
ProQuest, EBSCO and more.
Deadline: see web site
Notification: see web site
Revision: see web site
Publication: see web site
Dr Riktesh Srivastava
Assistant Professor, Information Systems, Skyline University College, University
City of Sharjah, Sharjah, PO 1797, UAE
Dr. T. C. Manjunath
HKBK College of Engg., Bangalore, India
Prof. Ning Xu
Wuhan University of Technology, China
Dr . Bilal Alatas
Department of Software Engineering, Firat University, Turkey
Dr Venu Kuthadi
University of Johannesburg, Johannesburg, RSA
The technical co-sponsorship provided by International agencies and institutions: Google Scholar,
CiteSeerX, Cornells University Library, Ei Compendex, Scopus, DBLP, DOAJ, ProQuest and EBSCO, is
gratefully appreciated. The excellent contributions of the authors made this volume possible and in
particular, we want to acknowledge the contributions of the following researchers: S. Antoinette Aroul
Jeyanthi, Dr. S. Pannirselvam, P. Rajeswari, Dr. T. N. Ravi, S. Prasath, A. Vinayagam, Dr. C. Kavitha, Dr.
K. Thangadurai, Dr. P. Raajan, K. Aruna Devi, Lydia Packiam Mettilda, D. Rajakumari, Yogalakshmi. S,
Venkatesh Kumar. R, Lawrance. R, S. Ponmani, R. Sridevi, R. Rahinipriyadharshini, Dr. S. K. Jayanthi, V.
Subhashini, D. Jemimah Sweetlin, J. Jebakumari Beulah Vasanthi, R. Sankarasubramanian, Dr. S.
Sukumaran, Dr. G. Anandharaj, D. Roselin Selvarani, J. Daniel Mano, Dr. S. Sathappan, Prof. T.
Ranganayaki, Prof. Dr. M. Venkatachalam, M. Anandhi, , Dr. Antony Selvadoss Thanamani, K. Priyanka,
R. Keerthana, M. Suguna, Dr. K. Meenakshi Sundaram, C. Veerakumar, M. Menaka, R. N. Muhammad
Ilyas, M. Parameswari, R. Pushpalatha, R. Rajamani, M. Rathika, K. Ramya, G. Vijaiprabhu. Particular
thanks go to S. Prasath for his professionalism and support as reviewer for this special issue.
The conference organizers and IJCSIS editorial board are confident that you will find the papers included
in this volume interesting and useful.
We support researchers to succeed by providing high visibility & impact value, prestige and efficient
publication process & service.
International Journal of Computer Science and Information Security (IJCSIS) Editorial board would like to
thank the committee members of the National Conference on Research Issues in Image Analysis &
Mining Intelligence for help they provided in the realization of this proceedings and paper selection.
Particular thanks go to Dr. R. Shanmugasundaram, Dr. K. Meenakshisundaram, Dr. S. Sukumaran, Prof.
C. Senthilkumar, Prof. M. Parameswari, Prof. T. Ranganayaki and Prof. R. Sankarasubramanian. A final
thanks goes to S. Prasath, IJCSIS Reviewer, for his diligent work to compile and review the selected
papers.
Table of Contents Page
Preface
1 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security
National Conference on Research Issues in Image Analysis & Mining Intelligence (NCRIIAMI-2015)
To address this problem, He et al. proposed another strategy 3.2 Subjective Measures of Interestingness
in a support-confidence-cost framework to discover action Domain experts (users) play an important role in the
rules directly from a database [10]. Ras et al. proposed an interpretation and subsequent application of data mining
approach to generate association-type action rules [11]. results. Hence the need to incorporate user-views, in addition
Subsequently, Ras and Dardzinska proposed a bottom-up to data related aspects. Generally, users differ in their beliefs
strategy to discover action rules without using pre-existing and interests since they may possess varied experience,
classification rules [12]. knowledge and psychological make-up. They may be
interested in different aspects of the domain depending on
3. INTERESTINGNESS MEASURES their area of work. In addition, they may also have varying
In recent years, a lot of work has been done in defining and goals and difference of opinions about the applicability and
quantifying interestingness. As a result, several measures usefulness of KDD results. This variation in interest enhances
that view interestingness from different perspectives have the importance of injecting subjectivity into interestingness
been proposed, developed and applied. Interestingness evaluation. Actionability and unexpectedness are two
measures attempt to capture the amount of interest any facets that determine subjective interestingness [14]. Another
pattern is expected to evoke on inspection. Merriam feature of interestingness pertains to prior knowledge of the
Websters collegiate dictionary defines interest as a feeling user. Prior knowledge increases the interestingness about a
that accompanies or causes special attention to an object or subject.
class of objects, or something that arouses such attention.
Interesting patterns are supposed to arouse strong attention Interestingness measures can play an important role in the
from users. A user might find a pattern interesting due to identification of novel, relevant, implicit and understandable
various reasons some of which may be difficult to articulate. patterns from the multitude of mined patterns. They help in
It has been found that interestingness is an elusive concept automating most of the post-processing work in data mining.
[13] consisting of many facets that are difficult to Application of interestingness measures during the various
operationalize and therefore difficult to capture. In some phases of data mining has its own advantages and
cases, a particular behavior in a domain might be interesting. disadvantages. If the dataset is large, then it may be
The same behavior exhibited in another domain may not be advantageous to mine rules and then apply interestingness
interesting. Thus, interestingness may be domain and user- measures. On the other hand, for a small one, it may be
dependent. In some other cases the same features may be preferable to apply interestingness measures during the
domain and user-independent. Capturing all features of mining phase itself.
interestingness in one single measure simultaneously is an
arduous if not an impossible task. 4. PROPOSED TECHNIQUE
In this paper, we proposed and implemented a dynamic rule
An important and useful classification scheme of filtering technique to filter the uninteresting patterns based on
interestingness measures may be based on user-involvement. the users belief. In this technique, objective measures are
This results in two categories - objective and subjective used as a first filter to remove rules that are definitely
measures [14][15]. Objective measures usually deal with data- uninteresting. Subjective measures are then be used as a
related aspects such as its distribution, structure of the rule second filter that bring in user beliefs.
and others while subjective measures are more user-driven.
Objective measures have their own limitations and biases. Fig 1: Proposed Framework
Objective measures when used as initial filters to remove
definitely uninteresting or unprofitable rules. Rules that are
statistically insignificant may be removed since they do not First a basic association mining is applied over data and from
warrant further attention. the discovered rules, the definitely uninteresting rules are
filtered by setting objective measures such as support
threshold and confidence threshold. On the other hand not all
2 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security
National Conference on Research Issues in Image Analysis & Mining Intelligence (NCRIIAMI-2015)
rules with high confidence and support are interesting. Rules Age and sex of a patient plays an important role in the
can fail to be interesting for several reasons. medical practitioners decision making process. According to
A rule can correspond to prior knowledge or medical Experts and doctors knowledge/belief, certain
expectation. medical tests are not applicable to certain age groups and sex.
A rule can refer to uninteresting attributes or We are entered those rules in a SQL database, RS. The rules
attribute combinations. can be added/removed dynamically to/from the database. The
Rules can be redundant. discovered patterns are compared with the rules in the RS, if it
matches those patterns are considered as irrelevant and
Hence, we consider user beliefs as a second filter. From the filtered.
first filtered rules we remove the rules that are irrelevant
according to the user beliefs. The user beliefs can be Portion of the Rules Set are shown in the Table.1.
implemented as rules set. We can add/remove rules set
dynamically. The discovered rules are compared with rules set Table 1. Portion of Rules Set
which is used to predict and prune whether the particular
pattern is interesting or not.
Age Group /
ID Test
Sex
4.1 Dynamic Rule Filtering Technique
1 Polio Tests Above 50
The proposed Dynamic Rule Filtering Technique is
implemented with the following algorithm.
2 Pelvic area Male
INPUT : MDB - Master Database
3 Testosterone replacement therapy (TRT) Female
RS - Rules Set consists of uninteresting rules.
(S) - Symptoms 4 Testosterone replacement therapy (TRT) 0-10
OUTPUT: Filtered Rules Set, FRS
5 .. ..
STEP 1: Activate the Master Database, MDB and search for
the Symptoms {s1,s2,..} 6 .. ..
STEP 2: Apply basic mining algorithm and to find all frequent
patterns which satisfies minimum support and minimum
confidence threshold, DB (Subset of MDB after extraction), The Fig.2 shows the screenshot of the dynamic rule filtering
do the following steps technique.
(a) Scan the database to calculate the support of each
item set.
(b) Add the item set to frequent item set if support is
greater than or equal to min_support.
(c) At each level divide the frequent item set into left
hand side and right hand side.
(d) Calculate the confidence of each rule that is
generated.
(e) Generate strong rules satisfying min_support and
min_confidence.
STEP 3: For each rule Rj in DB, do the following steps
(a) Compare Rj with all the Rules in RSi, i=1,2,..n
(b) If Rj matches then remove Rj from DB
Else Include Rj in FRS
STEP 4 : Output the set of filtered Rules, FRS
5. EXPERIMENTAL EVALUATION
The data set of 1000 clinical records having different
attributes such as symptoms, age, sex, disease, tests, Fig 2: Screenshot of Rule Filtering Technique.
treatments are entered in the back end database. In case, any
symptom is searched, the proposed approach search from the
back end database , finds all the frequent diseases, tests, The main parameters we considered for the analysis of
treatments associated with the searched symptoms and filter different association rule mining approaches are scalability;
those patterns which are not satisfy the minimum support and user interesting criteria.Table.2 shows the comparison of
minimum confidence threshold. The size of the discovered different association rule mining approaches.
rules is large and it contains some irrelevant rules.
3 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security
National Conference on Research Issues in Image Analysis & Mining Intelligence (NCRIIAMI-2015)
Table 2. Comparison of various Rule Mining Algorithm [3] B. Liu, W. Hsu, and S. Chen, Using general
impressions to analyze discovered classification rules,
CLOSET GART Proposed in Proceedings of the 3rd International Conference on
Parameter Knowledge Discovery and Data Mining (KDD 97), pp.
Algorithm Algorithm Algorithm 3136, 1997.
[4] Q. Yang, J. Yin, C. X. Ling, and T. Chen,
Postprocessing decision trees to extract actionable
Scalability Yes Yes Yes knowledge, in Proceedings of the 3rd IEEE
International Conference on DataMining (ICDM 03),
pp. 685688, November 2003.
User [5] P. Schrodt, Forecasting conflict in the Balkans using
hidden Markov models, in ProgrammIng for Peace, T.
Interesting No No Yes Robert, Ed., pp. 161184, Springer, 2006.
Criteria [6] Z. Ras and A. Tzacheva, In search for action rules of the
lowest cost, in Monitoring, Security, and Rescue
Techniques in Multiagent Systems, B. Dunin-Keplicz, A.
Jankowski, A. Skowron, and M. Szczuka, Eds., pp. 261
272, Springer, 2005.
Using simulation, we are able to get the efficient results in
[7] Z.W. Ras, L.-S. Tsay, A. A. Tzacheva, andO.Gurdal,
terms of number of rules filtered as compared to the classical Mining for interesting action rules, in Proceedings of
approach. Table 3 shows the comparison of number of the IEEE/WIC/ACM International Conference on
interesting rules filtered when second filter is applied and Intelligent Agent Technology (IAT05), pp. 187193,
September 2005.
when not applied. The number represents that each time
[8] K. Wang, S. Zhou, and Y. He, Growing decision trees
different symptoms are given to filter different set of on support-less association rules, in Proceedings of the
uninteresting rules. Our Experimental evaluation proves that 6th ACM SIGKDD International Conference on
the rules are generated and the filtered rules are uninteresting Knowledge Discovery and Data Mining (KDD 01), pp.
265269, August 2000.
to the user.
[9] A. A. Tzacheva and L.-S. Tsay, Tree-based construction
of lowcost action rules, Fundamenta Informaticae, vol.
Table 3. Number of Rules Filtered. 86, no. 1-2, pp. 213225, 2008.
[10] Z. He, X. Xu, S. Deng, and R. Ma, Mining action rules
Classical from scratch, Expert Systemswith Applications, vol.
Proposed Approach 29,no. 3, pp. 691 699, 2005.
Approach
[11] Z. W. Ras, L.-S. Tsay, A. Dardzinska, and H. Wasyluk,
Association action rules, in Proceedings of the IEEE
No. No. Of Rules Filtered when International Conference on DataMiningWorkshops, pp.
No.of Rules 283290, December 2008.
Second filter
[12] Z. W. Ras and A. Dardzinska, Action rules discovery
Filtered without pre-existing classification rules, in Rough Sets
Not Applied Applied and Current Trends in Computing, vol. 5306 of Lecture
Notes in Computer Science, pp. 181190, Springer,
2008.
1 978 850 535
[13] Fayyad U, Uthurusamy R, Evolving data mining into
solutions for insights, Pg: 28-31,Commun.
2 780 600 255 ACM45(8),2002.
[14] Silberschatz A, Tuzhilin A , What makes patterns
3 855 436 198 interesting in knowledge discovery systems, pg: 970-
974,IEEE Trans. Knowledge Data Eng., 1996.
[15] Freitas A A, On objective measures of rule
4 926 742 511 surprisingness. Proc. Second European Symposium
Principles of Data Mining and Knowledge Discovery,
(PKDD-98), pp 19,1998.
[16] Meo R, Theory of dependence values, ACM Trans.
Database Syst. 25: 380406,2000.
6. CONCLUSION [17] Shekar B, Natarajan R,A transaction-based
This paper discusses the concept of interestingness measures neighbourhood-driven approach to quantifying
and its usage. This paper also proposed and implemented a interestingness of association rules,pp 194-201, Proc.
new dynamic rule filtering technique which uses user beliefs Fourth IEEE Int. Conf. on Data Mining (ICDM ), 2004.
as a second filter and filtered the irrelevant rules. In the
proposed technique, we identify interesting rules indirectly by
the elimination of uninteresting rules. Effectiveness of this
approach lies in its ability to quickly eliminate large families
of rules that are not interesting.
7. REFERENCES
[1] Gregory, Piatetsky-Shapiro, Discovery, analysisand
presentation of large Rules, AAAI Press, pp. 229-248,
Menlo Park,CA, 1991.
[2] Hoschka and Klosgen, A Support System for
Interpreting statistical data, AAAI Press, pp 325-
345,Menlo Park,CA, 1991
4 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security
National Conference on Research Issues in Image Analysis & Mining Intelligence (NCRIIAMI-2015)
5 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security
National Conference on Research Issues in Image Analysis & Mining Intelligence (NCRIIAMI-2015)
routes may either eventually be dropped or be delayed based Channel Contention In order to discover network
on the network transport protocol. topology, nodes in a MANET must communicate on a
common channel which introduces the problems of
5. Congested nodes or links: Certain nodes or links may interference and channel contention.
become over utilized due to the dynamic topology of the
network and the nature of the routing protocol thus leading to Limited Device ResourcesEven though mobile devices are
either larger delays or packet loss. becoming increasingly powerful, such devices generally have
less computational power, less memory, and a limited
The overview of the further paper is as follows: Section 2 (battery) power supply, compared to devices such as desktop
narrates the outline of QoS, Section 3 briefs about QoS computers typically employed in wired networks. This factor
routing protocols, Section 4 deals with the Review of has a major impact on the provision of QoS assurances.
literature and Section 5 provides Conclusion.
2.4 QoS Constraints
2. QUALITY OF SERVICE [Aur 02] has identified the following QoS constraint types
2.1 QoS Definition and also stated the various parameters that can be used in each
[Mas] QoS for a network is measured in terms of the category, as listed below:
guaranteed amount of data that is transferred from one place
Table 2: QoS constraints
to another during a certain time. The QoS is identified as a set
of measurable pre-specified service requirements such as Constraints type Parameter used
delay, bandwidth, probability of packet loss, and delay
variance (jitter). [San et al 12] The goal of Qos is to achieve a Time constraints / Additive Delay, jitter and Cost
more deterministic network behavior, so that the information constraints
carried out by the network can be better delivered and Space constraints System buffer
network resources can be better utilized.
Frequency constraints / Network/System
2.2 Why QoS is needed and difficult? Concave / Convex bandwidth
MANET has special features like autonomous architecture, Constraints
distributed operation, multi-hop routing, reconfigurable
topology, fluctuating link capacity, and light weight terminals. Multiplicative constraints Loss Probability
Certain issues such as security, routing, reliability,
Reliability constraints Error rate
internetworking, and power consumption are to be considered
while designing MANET, because of the shared nature of the
wireless medium, node mobility, and battery limitations.
2.5 QoS parameters
Due to the rapid expansion of multimedia technology and real [Ama et al 13] has pointed the various QoS parameters that
time application, QoS is needed. To support QOS, the link are suitable for different areas:
state information such as delay, bandwidth, cost, loss rate and
Table 3: QoS parameters
error rate in the network should be available and manageable.
However, getting and managing the link state information in Areas Parameters used
MANET is very difficult because the quality of wireless link
is subject to change with the surrounding circumstances. Due Multimedia Jitter, delay, Bandwidth
to the bandwidth constraint and dynamic topology of Military Security
MANET, support of QOS for the delivery of real-time
communications in MANET is a challenging task. Networks (Routing) Responsiveness, Throughput,
Capacity, Propagation delay,
2.3 Challenges Round trip time, Delay variation
Providing better QoS in MANET is challenging due to the (Jitter), Maximum Transmission
following: [Laj et al 07] Unit, Bandwidth Delay Product
Node Mobility: In MANET, the topology is highly dynamic
in nature due to the mobility of nodes. Since the topology has
short lives, routing overhead is high as frequent updation is to 2.6 Single constrained Vs Multi constrained QoS
be carried out, to allow the data packets to be routed to their
destinations. Also, due to the node mobility, packet loss
Metrics
increases which in turn affect the end to end delay. In earlier days, most of the routing protocols have
concentrated on providing the QoS parameter Throughput
Lack of Central Control: In MANET, Networks can be
service only, which had received success in many aspects but
setup at any place and time without any pre-existing
they do not always perform the best. In CEDAR, bandwidth
infrastructure and provides access to information and services
was the only QoS parameter used in routing. But there are
regardless of geographic position. This makes it difficult to
certain areas like Multimedia applications which require multi
have any centralized control. Hence the controlling activities
QoS parameters such as Jitter, delay, cost etc, forced the trend
have to be distributed among the nodes, which require a lot of
to move from single constrained routing to multi constrained
information thus increasing the routing overhead.
routing. The main function of Multi constrained routing is
Unreliable Wireless Channel Due to interference from tofind a feasible path that satisfies multiple constraints
other transmissions, thermal noise etc., the wireless channel is simultaneously, which is a big challenge in MANET due to
prone to bit errors. This makes it impossible to provide hard the dynamic topology. Example QMRPD (QoS Multicast
packet delivery ratio or link longevity guarantees. Routing protocol for Dynamic group topology), GAMAN
(Genetic algorithm based routing for MANETs), HMCOP
(Heuristic Multi constrained Optimal path). [Sam et al 12]
6 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security
National Conference on Research Issues in Image Analysis & Mining Intelligence (NCRIIAMI-2015)
2.7 Hard QoS Vs Soft QoS approach 2. Maintenance problem: When network topology changes,
new routes if available has to be chosen, or can be found
If the Qos requirements of a connection are guaranteed to be
quickly in such a manner it supports existing QoS
met for the entire session, the QoS approach is termed as Hard
requirements.
QoS approach. In MANETs, it is very difficult to provide hard
QoS approach due to the limited bandwidth, energy constraint 3. Variable resource problem: Whenever there is a change
and dynamic topology. If the QoS requirements are not in the route or in link characteristics, it addresses how to react
guaranteed for the whole session, then it is termed as Soft to changes in available resources within the route.
QoS approach and most of the protocols provide only soft
QoS. 3.4 Layered Architecture of QoS
According to [See et al 12], the layered view/architecture of
3. QoS ROUTING quality of service contains 3 parts as given in Figure 2.
Qos routing is a routing process that guarantees to support a
set of QoS parameters during establishing a route. The QoS User
routing in MANET need to support the multimedia real-time Application
communication like video-on-demand, news-on-demand, web Network
browsing, traveler information system etc. These applications
require a QoS guarantee not only over a single hop, but on the
entire wireless multi-hop. [See 12] User
Application
3.1 Goals of QoS Routing
The goal of QoS routing is two fold: a) Selecting network Network
paths that have sufficient resources to meet the QoS
requirements and b) Achieving global efficiency in resource
utilization. [Shi et al 99] Figure 2: Layered view of QoS
7 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security
National Conference on Research Issues in Image Analysis & Mining Intelligence (NCRIIAMI-2015)
Link and Delay, Link reliability, Normalized Algorithm Parameter Simulation Outcome
MAC Link stability, Node MAC load, considered result
relative mobility MAC energy Parameters
efficiency discussed
Physical Signal-to-interference Enhanced Hop count, Packet Reduced
layer ratio, Bit error rate, Node AODV for end-to-end delivery processing
residual battery charge - providing delay, ratio, Routing overhead
QoS of Bandwidth overhead for and
Transport/ Session Multimedia audio, video Complexity
Application acceptance application and data
Layer - /blocking ratio, in MANET
Session
completion / A QoS- Bandwidth, Average Found the
dropping ratio Based DSR Link delay, throughput, most feasible
for Signal packet route
Supporting strength delivery
Selected the
3.7 Network Resources needed to provide QoS Multimedia ratio, lower
Services in packet loss most stable
Network resource is required to perform a task and is
rate and link thus
consumed during performance as correctly pointed by [Laj et
Mobile Ad lower routing reducing
al 07. Therefore, the following is a list of network resources:
Hoc overhead. route
Networks maintenance.
Node computing time while mobile devices are being
manufactured with increasingly powerful processors, they are
still limited in computing power. As communication protocols
usually do not place a heavy burden on the processor, it is the
4.2 Routing based QoS
least critical resource. [Ash et al 14] developed power hop based Ad-Hoc On
Demand Distance vector (PH-AODV) which used node power
Node battery charge it is the most critical resource and hop count parameters to select the best routing path. This
because if a nodes battery is drained, it cannot function at all. method checks the source route table to find a valid route if
Node failures can also cause network partitioning, Hence exists or otherwise starts a route discovery process. Aimed to
power aware and energy efficient MAC and routing protocols achieve better throughput, better average end-to-end delay and
have received a great deal of research attention. better averag drop packets.
Node buffer space (memory) during a networks [Ila et al 13] proposed Quality of Service (QoS) Routing in
operation, more than one node will be transmitting at the same Mobile Ad-hoc Network (MANET) using AODV protocol:
time, or there may be no known route to another device. In Cross-Layer Approach, considers several constraints for the
either of these cases data packets must be buffered. route selection instead of considering hop count, to attain
Furthermore, when the buffers are full, any newly arriving reliable data transmission.
packets must be dropped, thus contributing to the packet loss.
[Gov et al 13] focused on the Improvement of QoS assurance in
Channel capacity Measured in bps and affects data MANET by QSIG-AODV: a QoS based signaling and on
throughput, and indirectly, delay and hence a host of other demand routing algorithm. Network layer makes use of in-
metrics too. band signaling and AODV routing protocol to find the best
path which satisfies QoS requirements. Divides the concept
4 REVIEW OF LITERATURE into 2 parts i) Qos signaling- responsible for fast resource
4.1 Multimedia based QoS reservation and restoration and ii) QoS routing to provide an
[Win et al 14] has developed an Enhanced AODV for optimal routing based on AODV protocol.
providing QoS of Multimedia application in MANET. Aims Table 6: Comparative table for Routing based QoS
in obtaining the most relevant path and also computes
multiple disjoint alternate paths from source to destination. Algorithm Concept Outcomes
When the current active path breaks, the second highest
priority will be used for packet transmission. The application PH-AODV routing Considers node Average
content is classified into Video, Audio and data. Hop count algorithm power and hop throughput,
and end-to-end delay is used as optimization parameter and count average end-
bandwidth is used as constraint parameter. to-end delay
and average
[Gee et al 12] proposed an On Demand QoS routing scheme drop packets
QoS-Based Dynamic source Routing (QDSR) protocol. This
scheme separated the data service into two groups- i) Non real Quality of Service Used Cross layer Packet
time data service where bandwidth is not so sensitive and (QoS) Routing in approach to delivery ratio,
allows delay during transmission. ii) Real time data service Mobile Ad-hoc improve delay and
where delay and bandwidth are very sensitive. Focused on Network (MANET) information packet drop
finding a route that satisfies the QoS requirements and had a using AODV sharing between
better chance for the route survival over a period of time even protocol: Cross- network and
after the node movement. Layer Approach physical layer
Table 5: Comparative table for Multimedia based QoS
8 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security
National Conference on Research Issues in Image Analysis & Mining Intelligence (NCRIIAMI-2015)
Improvement of Makes use of QoS Increased 7. [Kui et al 12] Kui Wu and Janelle Harms, QoS support in
QoS assurance in in-band signaling packet Mobile Ad Hoc Networks, University of Alberta, 2012.
MANET by QSIG- mechanism and delivery ratio,
AODV: a QoS AODV protocol Negligible 8. [Laj et al 07] Lajos Hanzoi and Rahim Tafazolli,
based signaling and congestion University of Surrey, A Survey of QoS Routing solutions
on demand routing and end-to- for Mobile Ad Hoc networks, IEEE Communications
algorithm end delay Surveys & Tutorials, Vol. 9, No. 2, 2nd Quarter 2007.
9 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security
National Conference on Research Issues in Image Analysis & Mining Intelligence (NCRIIAMI-2015)
10 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security
National Conference on Research Issues in Image Analysis & Mining Intelligence (NCRIIAMI-2015)
of average anthropometric measures. Corners, the salient 3.1 Local Binary Pattern (LBP)
features of the eyes, are detected and used to set the initial Texture is a term that characterizes the contextual
parameters of the eye templates. Based on the corner positions property of an image [2]. A texture descriptor can characterize
locate the templates in relation to the eye images accurately an image as whole alternatively it can also characterize an
and greatly reduce the processing time for the templates. image locally at the micro level and by global texture
Yuille et al.,[9] presented a model for the detection of description at the macro level. LBP method is used to label
face features using deformable templates. The face features every pixel in the image by thresholding the eight neighbors
are described by a parameterized template for an energy of the pixel with the center pixel value. If a neighbor pixel
function is defined which connect edges, peaks and valleys in value is less than the threshold then a value of 0 is assigned
the image intensity to the corresponding template properties. otherwise it is 1.
The template interacts dynamically with the input image by The original LBP operator has been used
manipulating its parameter values to minimize the energy extensively for texture discrimination, demonstrating
function. excellent results and good robustness against rotation. The
Yunhui Liu et al., [10] presented a face normalization operator labels the pixels of an image by thresholding a 33
method based on the location of eyes. The face has been neighborhood of each pixel with the center value and
detected based on the boosted cascade of simple Haar considering the results as a binary number. The histogram of
features. This method detects the position and the distance the labels can be used as a texture descriptor [1]. The
between the face and the eyes with the orientation properties. limitation of the fundamental LBP operator is its small
S.T.Gandhe et al., [11] developed a contour matching neighborhood which cannot capture dominant features with
based on face recognition system is proposed. The shape of large scale structures.
the contours is affected by the tilting a face though the effect 3.2 Gray Level Co-occurrence Matrix
are not examined. It is important problem and the threshold One of the simplest approaches for describing
values automatically according to the characteristics of the texture is to use statistical moments of the intensity histogram
image. The algorithm needs to be tested for larger variation of of an image. Using a statistical approach such as co-
database. The face similarity indicator is found to perform occurrence matrix will help to provide valuable information
satisfactorily in adverse conditions of exposure, illumination, about the relative position of the neighbouring pixels in an
contrast variations and face pose. image. The Gray Level Co-occurrence Matrix is a tabulation
S.T.Gandhe et al., [12] implemented different method of how often different combinations of pixel brightness values
of face recognition namely Principal Component Analysis, occur in an image. It was developed by Haralick.
Discrete Wavelet Transform Cascaded. The feasibility of 3.3 Proposed Hybrid Model
these algorithms for human face identification is presented The proposed hybrid model with multiscale feature
through experimental investigation. In contour matching extraction is the combination of Auto correlation function,
recognition rate and recognition time per image is very high. Fiducial Point Feature [FPF] and octagon model. Each
To rescale the energy function in Hopfield network to avoid technique constitute in the proposed model is used for
the spurious states and to improve the recognition rate. The different processes. FPF based technique is used for
algorithm needs to be tested for large variations of pose. segmentation of face image in the generation of features, the
Various literatures on the previous methods has been auto correlation function is used for the extraction of texture
analyzed which are relevant to the proposed research. Even feature and finally the octagon model is used to extract the
though there are various models and techniques have been triangle and rectangular face image based features.
developed, still there is lot of issues being not rectified in face Input Image
recognition. In order to overcome these issues, it is necessary
to propose a hybrid model for face recognition and retrieval
with multiscale features to achieve better results and accuracy.
FTex generation using FFPF generation with Feature set FOCT
From the analysis there are various drawbacks in the existing Texture feature with
Auto Correlation fiducial point features
systems are identified. Some merits and demerits make the Eye Rectangle
function
system anonymous. The parameter such as Computational
cost, Processing Time, complexity in process it makes the
Eyebrow Triangle
existing system more complex. In order to overcome the Nose
issues in the existing system there is a need to propose a
Mouth
hybrid model which is organized in a simplified way to
accomplish the face recognition and retrieval efficiently.
FHFE = { FTex, FFPF, FOCT }
3.METHODOLOGY
In order to overcome the limitations in the existing Image Matching
methodology there is a need to proposed method for face
recognition and retrieval with multiscale features. Image Retrieval
Fig.1.1 Process Flow of Proposed Hybrid Model
11 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security
National Conference on Research Issues in Image Analysis & Mining Intelligence (NCRIIAMI-2015)
The above Fig.1.1 shows the process flow of the proposed In this step the triangle and rectangle are converted
face recognition and retrieval process with the generated into the octagonal shaped pixels. The feature selection based
multiscale features of face. The proposed model generation of on edges of face image is an interesting topic in research
feature set makes the recognition and retrieval efficient. In community and the outer edges are considered for extract. The
order to overcome the issues in the existing system the estimated feature set as in the following equation 3.2.
proposed a hybrid feature extraction model includes the
maximum features and provides better results.
{
FOCA = FOCC + FOCT + FOCR (3.2) }
The proposed model take place in three sets of features // OCA- Octagon Area//
and is presented below, Where
Estimation of Octagon based features. Area= Center + Triangle + Rectangle
Extraction of Texture feature using auto The area of one of the four triangles
correlation function. 1 1 1 2
Area of triangle= b.h = m.m = m
Extraction of Fiducial point features. 2 2 2
The above mentioned features are extracted in the The area of one of the four rectangles
generation of feature set using different techniques more Area of rectangle= s.m
suitable for feature extraction and are presented in the A side length of m to find the area of entire regular octagon.
following sections. To generate the area expression below by summing smaller
areas within the octagon.
3.3.1 Octagon Feature 1
FO C A = (s ) + 4 (
2 2
) + 4 ( s. m )
Octagon is one of the shapes of polygon. It is a 2 m
closed plane shape and also enclosed with the eight straight 1 s 2 s
=
2
there are eight vertices and eight edges. The extracting edges =
2
2s ( 1+ 2)
from the input image the closed edges are considered for the
The octagon will be rectangle shapes and triangles
extraction of edge orientation.
are identified and computed as represented in the following
equation,
FOCT = {FOCT1 + FOCT 2 + FOCT 3 + FOCT 4 }
Finally,
{
FOCTF = FOCE , FOCT , FOCR } (3.3)
The octagon based features are extracted and
Fig. 1.2 Octagon with edges
experimented with the image collected from standard face
F ,F ,F ,F ,
FOCE = OCE1 OCE 2 OCE 3 OCE 4 (3.1) database ORL DB. The image of each size is considered for
F , F , F , F
OCE 5 OCE 6 OCE 7 OCE 8 this experiment as shown in Fig.1.4.
//F-Face,OCE - Octagon Edges//
The first divide the regular octagon into parts to find the area
OCE1,OCE2,OCE3,OCE4,
Area=
OCE5,OCE6,OCE7,OCE8
Identify the area of regular octagon and compare with the
triangle and octagon rectangle methods. To find area of
octagon sum all the areas of interior shapes. The octagon will
101_1 102_1
be regular 4 rectangle shapes and 4 triangle shapes are
identified and computed.
To calculate Length S
M =
2
To calculate interior shapes center = S2
103_1 104_1
Fig.1.4 Octagon features of images extraction
12 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security
National Conference on Research Issues in Image Analysis & Mining Intelligence (NCRIIAMI-2015)
extracted with the auto-correlation function [PANN09] and considered. The features used for fiducial points are extracted
finally, the octagon based features for face images such as separately and matched.
number of triangle and rectangles are extracted with the 4.1 Generation of Fiducial Point Feature
octagonal algorithm. The extracted three feature sets are The segmented facial image is considered for the
concatenated together in the generation of multiscale features feature extraction. The features used to establishing the face
for better face recognition and retrieval and are presented in fiducial point based feature set are eye, eyebrow, nose and
the following sections. mouth. Here, facial fiducial point feature set generation FPF
are considered for the image recognition and retrieval. The
3.3.2 Auto Correlation Function estimated and stored in a feature set as the following equation
In statistics, the auto correlation function of a 3.7.
random process describes the correlation between the FFPF = {FEB , FE , FM , FN } (3.7)
processes at different points in time.
Here, face fiducial point feature set generation are
considered for the recognition and retrieval. The establish a
3.3.2.1 Feature Extraction with Auto Correlation
feature set of face image recognition and retrieval as presented
Coefficient
in equation 3.18.
In this section the extraction of feature vectors of the
12,14,148,170,12,14,149,149,12,14,162,199,
input image with the auto correlation function. The gray scale FFPF =
12,14,158,121,12,14,119, 29,12,14,164,168
input image Ii of size (MN) from the image database is
considered for the extraction of feature vectors. Let the input The extracted feature set block of the target image is
image Ii be divided into the K-sub images S1, S2,.... Sk, each of obtained. These procedures are repeated for all the blocks and
size (nn), where n<=M,N using the equation obtain the feature set for each image in the IDB.
4.2 Generation of Feature Set with Texture
k = 2 2( n1) (3.4) In order to extract the gray values instead of binary
Where n=1,2,3,....... the feature vectors are then values are used the texture feature in the generation of feature
extracted for each of the k non-overlapped sub regions. set. The auto correlation function, feature set FTex of the input
The auto correlation coefficients that range from 0 image is established and the graphical representation of the
to 1 are computed for (i,j) directions with the positional feature set presented as below:
difference (p,q) of each block using the following equation
3.5.
F
Tex {1 2 k }
= F , F ,......., F (3.8)
generate the feature vectors some statistical measures are used Hence, the obtained octagon features values of each
for the computation. block are represented. Then the feature vectors are obtained
with the values as mentioned in equation 3.9. The extracted
4. Generation of Feature Set feature set values of the image is obtained as:
The facial feature plays an important role in face 1.000, 0.998, 0.996, 0.992, 0.989, 0.992, 0.995, 0.997,
FOCTF =
image classification, recognition and retrieval system. In this 0.972,0.976, 0.980, 0.983, 0.958, 0.961, 0.964, 0.967
section discussed about the extraction of facial features are
13 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security
National Conference on Research Issues in Image Analysis & Mining Intelligence (NCRIIAMI-2015)
Here, the octagon feature set generation is . images accepted out of database (3.13)
FAR=Noof
considered for the recognition and retrieval. The extracted Total no. of images in database
feature set of the block of the target image is obtained. 5.2 False Rejection Rate (FRR)
4.4 Generation of Feature Set with Proposed Methodology The ratio of number of correct images rejected in the
In the earlier sections discussed about the generation database to total number of images in database. The false
of feature set with the various features present in the face rejection rate is computed to evaluate the performance of the
image. The various features are such as eye, eyebrow, nose face image matching using following equation 3.14.
and mouth. Along with these features have used the texture
FRR = N o.of correct im ages rejected (3.14)
features present in the input image with the modified auto Total no. of im ages in database
correlation function. The representation of the multiscale
feature set is presented in the equation 3.10. 5.3 Image Retrieval Rate
{
FMFE = FFPF , FTex , FOCTF } (3.10) The performance of image retrieval is measured in
terms of precision(P) and recall(R) which are defined below:
Where {FFPF } the fiducial point feature set is
14 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security
National Conference on Research Issues in Image Analysis & Mining Intelligence (NCRIIAMI-2015)
// procedure octagon //
Procedure octagon ()
{
Step 1: Read the image for octagon. 103_2 104_1
Step 2: Initialize edge orientation by using equation 3.1. Fig.1.5 Sample Images considered for experimentation
Step 3: Bound the closed edges with rectangle and triangle grid.
Step 4: Eliminate the edges present in the rectangular if area less
than of the bounded rectangular are and count no. of Each image in the IDB is subjected to the above
rectangle and no. of triangle for feature extraction. mentioned extraction process with the proposed model. The
Step 5: Calculate the triangle and rectangle based general proposed model coefficients are obtained as described in
feature set as discussed in section 3.3.1.
section. The procedure auto_corr( ), procedure
Step 6: Calculate the octagon features of image mentioned in
equation 3.3. octagon_feature( ) and procedure FPF_feature( ) are applied.
Step 7: Repeat Step2 through Step6 for all the input images and Then the extraction of entire process by using the values
find the feature vectors of each triangle, edge and equation 3.11. The proposed algorithm 6.1 is applied on the
rectangle.
Step 8: Return. image from the selected image database for the
} experimentation. The image 103_2 which is considered as the
input image. Initially the fiducial point features are extracted
//procedure Euclidean distance // to generate feature sets. The resultant of fiducial point features
Procedure Eucli_Dist ()
{
feature set is generated for the input image 103_2 and the
Step 1: Compute the distance measures for number of images establish a feature set of face image recognition and retrieval
from IDB with the target image using the equation 3.12. mentioned in equation 3.7.
FFPF ={FEB, FE, FM , FN}
Step 2: Return
}
15 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security
National Conference on Research Issues in Image Analysis & Mining Intelligence (NCRIIAMI-2015)
12,14,148,170,12,14,149,149,12,14,162,199,12,14,158,121,
12,14,119, 29,12,14,164,168,1.000, 0.999,0.998,0.996,
0.971,0.974,0.977,0.980,0.945, 0.946, 0.948, 0.950,
FMFE =
0.923, 0.92, 0.927, 0.929,1.000, 0.998, 0.996, 0.992,
0.989, 0.992, 0.995, 0.997, 0.972,0.976, 0.980, 0.983,
101_1 103_1
0.958, 0.961, 0.964, 0.967
16 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security
National Conference on Research Issues in Image Analysis & Mining Intelligence (NCRIIAMI-2015)
Table 1.1 Recognition Rate of Face images From the Table 1.2 shows the precision and recall of
Model Acceptance % Rejection % the proposed model in which the proposed model provides the
recall 1.25 and precision 0.5 and AC model provides 1.10 and
GLCM 80.50% 19.5% 0.3 respectively. Hence, the proposed model is also efficient
for image retrieval. The pictorial representation of the
LBP 91.90% 8.50% retrieval performance is shown in the following chart.
9. REFERENCES
[1] Xiao-rong pu, yi zhou, and rui-yi zhou face recognition
on partial and holistic lbp features , journal of electronic
science and technology, vol. 10, no. 1, 2012
[2] T.Ojala, M.Pietikainen, and D.Harwood, A comparative
study of texture measures with classification based on
Fig.1.7 Performance evaluation of proposed method featured distribution, Pattern Recognition, vol. 29, no.
1, pp. 5159, 1996.
Hence, the retrieval rate is estimated in terms of [3] M.Karg, R.Jenke, W.Seiberl, K.K.A.Schwirtz and
precision and recall. The precision and recall of the proposed M.Buss, A Comparison of PCA, KPCA and LDA for
method is presented in the following table. The performance Feature Extraction to Recognize Affect in Gait
of the existing schemes with precision and recall are also Kinematics, 3rd International Conference on Affective
obtained. For every comparison, they are also incorporated in Computing and Intelligent Interaction and Workshops,
the same Table 1.2. pp.1-6, 2009.
[4] O.Toygar and A.Acan, Face Recognition Using PCA,
Existing Methods LDA and ICA Approaches on Colored Images, Journal
Existing Method
Proposed Hybrid Model
of Electrical & Electronic Engineering, vol.3, pp.735-
Recall Precision Recall Precision 743, 2003.
[5] F.Song, Z.Guo, and D.Mei, Feature Selection Using
1.25 0.5 1.10 0.3 Principal Component Analysis, International
Conference on System Science, Engineering Design and
Table 1.2 Comparison Result of Proposed Method with Manufacturing Informatization, vol. 1, pp. 27-30, 2010.
Existing Methods
17 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security
National Conference on Research Issues in Image Analysis & Mining Intelligence (NCRIIAMI-2015)
18 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security
National Conference on Research Issues in Image Analysis & Mining Intelligence (NCRIIAMI-2015)
ABSTRACT to the earth as acid rain. Acid rain cause the damages to
An air pollutant is a substance in the air that can have adverse forests and crops.
effects on humans. A pollutant can be of natural or man-made.
Harmful emission in the air could extremely affect human 1.1 Sources of Air pollution
health. Forecasting models is essential for predicting air
quality. Air quality and its pollution monitoring services are
The following are the various factors which are responsible
provided by many countries. In this paper we describe our
for releasing pollutants in the atmosphere. They are classified
experiences with developing a personal air pollution exposure
into two main categories:
estimation system using particle swarm optimization. SO2
emissions have been an international pollution factor because
of Coal and petroleum which often consist of sulfur Anthropogenic (man-made) sources:
compounds. Here the Particle Swarm Optimization (PSO) is
used for analyzing world SO2 emission based on the global Burning of fuel results in the following,
energy conservation. The experimental results show that PSO
can provide good modeling results compared to other linear
models. Stationary sources include smoke of power plants,
manufacturing units and waste incinerators which is
General Terms a unit where the waste is burned.
Sulphur Dioxide (SO2), Particle Swarm Optimization (PSO) Mobile sources include road transport vehicles,
marine vessels, and aircraft engines.
Keywords Fumes from paint, spray, varnish, aerosol and other
Air Pollution, Right Fit solvents
Other resources, such as nuclear weapons
1. INTRODUCTION and toxic gases.
Point sources: Large, stationary sources with
Air pollution is caused by the materials present in the air such relatively high emissions, such as electric power
as biological molecules, or other harmful materials possibly plants and refineries.
causing disease, death to humans, and damage to other living Nonpoint sources: Smaller stationary sources such
organisms. According to the WHO report, air pollution causes as dry cleaners, gasoline service stations and
the death to millions of people worldwide.[1] Sulfur residential wood burning. May also include diffuse
oxides (SOx) - particularly sulfur dioxide (SO2) is produced stationary sources such as wildfires and agricultural
by various industries. Coal and petroleum combustion tilling.
generates sulfur dioxide (SO2). Further oxidation of SO2, On-road vehicles: Vehicles operated on highways,
usually in the presence of a catalyst such as NO2, forms streets and roads.
H2SO4, and thus acid rain.[2] This is one of the causes for Non-road sources: Off-road vehicles and portable
concern over the environmental impact of the use of these equipment powered by internal combustion engines.
fuels as power sources. Includes lawn and garden equipment, recreational
equipment, construction equipment, aircraft and
locomotives.
Sulfur dioxide (SO2) is a gas primarily emitted from fossil
fuel combustion at power plants and other industries. The fuel
combustion in mobile sources such as locomotives, ships, and Natural sources:
other equipment also produce SO2. SO2 exposure leads
adverse impacts on the respiratory system. In recent reviews Dust from natural sources, usually large areas of
of the EPA standard has determined that even short term land with few vegetation
exposure to high levels of SO2 can have a detrimental effect Methane gas, which is emitted by animals while
on breathing function, particular for those that suffer from their digestion takes place.
asthma. SO2 also reacts with other chemicals in the air to form
tiny sulfate particles, contributing to levels of PM 2.5. SO2 also
reacts with other chemicals in the air to form acids, which fall
19 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security
National Conference on Research Issues in Image Analysis & Mining Intelligence (NCRIIAMI-2015)
Where the term w is the inertia weight, and is used to balance Maximum number of iterations (1000)
the local and global search abilities of the algorithm,
controlling the influence of previous information when Validation or generalization error greater than 5%
20 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security
National Conference on Research Issues in Image Analysis & Mining Intelligence (NCRIIAMI-2015)
3.1 Nature of Data The search process begins by random distribution of the initial
PSO population which represents a random sample of the
The Particle Swarm Optimization (PSO) model was used to PSO search space. The corresponding Right fit of each
forecast the SO2 emission. The related data were used, for particle is computed. Particles with higher Right fit shall be
installing the models. The values of four energy commodities selected to produce new population which enhances the
like global oil, natural gas, coal and primary energy are features of their parents. The problem under consideration is
considered the highest. to estimate the correct parameters 1 , 1 , 2 , 2 , 3 , 3 , 4 , and
for the proposed exponential model. A data set adopted
from[5] for the model development process was used. We
3.2 Evaluation Criterion defined the parameter space and other PSO setup tuning
parameters are given in Table 2. The Right fit (i.e. quality) of
For finding the performance of the PSO model, Route Mean a particular estimate is obtained by observing the behavior of
Square (RMS), the Manhattan distance (MD), the Euclidian the system with the estimated parameters, and using the RMS
distance (ED), and the Variance-Accounted-For (VAF) were error between the actual and predicted SO2.
measured. The performance criteria for these were assessed to
measure the PSO approach. This criterion helps in evaluating SO2 = 1Oil 1+ 2 NG 2 + 3Coal 3 + 4
4 PE + (7)
the capabilities of the developed PSO model. The RMS, VAF,
ED and MD are computed as follows:
RMS = n
i=1 |(yi -- ^yi)| / n (3) Inputs Oil; NG; Coal; PE
(3) Euclidian distance (ED): Table 2. The tuning parameters for the PSO
21 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security
National Conference on Research Issues in Image Analysis & Mining Intelligence (NCRIIAMI-2015)
22 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security
National Conference on Research Issues in Image Analysis & Mining Intelligence (NCRIIAMI-2015)
23 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security
National Conference on Research Issues in Image Analysis & Mining Intelligence (NCRIIAMI-2015)
trust layer to ontology. It is also a mechanism to find the bioinformatics. At one level of heterogeneity the
information source. Signature and cryptography techniques programmatic access to bioinformatics resources used as web
are used for security purpose. The trust is classified into four services which the semantic web technologies, the common
types. We organize trust research in four major areas:
access is to distributed resources in heterogeneous platforms.
Policy-based trust: It is established using various policies The main objectives of the Semantic Web are to make facts
which are focused on managing and exchanging credentials amenable to machine processing. The Resource Description
and enforcing access policies. It generally established would a Framework (RDF) provides a common data model of triples
sufficient amount of credentials pertaining to a specific party, for this processing. An RDF triple consist of subject, predicate
and to grant certain authentications rights to the party. The (verb) and object, which enables any statement to represents
various issues in this are solved using third party to serve as in a simple and flexibly common framework. The triple names
an authority for issuing and verifying credentials. a resource using either a Uniform Resource Identifier (URI)
or a literal. In order to transform the resources to data model,
Reputation-based trust: It is established with the earlier the common naming scheme is used by forming a vast graph
interactions and performance for an entity is to provide access of descriptions of resources [4]. The Life Science Identifier
to its future behavior. It uses the history of an entitys (LSID) is a form of URI that can be used to uniquely identify
actions/behavior to compute trust, and may use referral based and version bioinformatics entries.
trust in the absence of first-hand knowledge. The
recommendations are trust decisions made by other users, and 3.4 Semantic Web in bio informatics
combining these decisions to synthesize a new one, often
personalized, is another commonly addressed problem. Due to the increased in the volume of bioinformatics related
data available on the Web. The access of data is being a
General models of trust: The models useful for analyzing tedious task for the researchers and scientists. Due to the
human and agenized trust decisions and for operational zing heterogeneous computing environment the users are used to
computable models of trust. It also describes values or factors swap between various websites for this information access.
that play a role in computing trust, and leans more on work in The Oracle RDF Data Model has generated much attention.
psychology and sociology for a decomposition of what trust This stems from the desire to be able to manage RDF data in a
comprises. It is also ranges from simple access control polices secure, scalable, and highly available environment. Users
to analyses of competence, beliefs, risk, importance, utility, have the flexibility of being able to incorporate multiple
etc. media data, such as images and text, into the RDF graphs. In
Additional benefits that includes simplified integration of
Trust in information resources: Trust in information
RDF data with other enterprise data, re-use of RDF objects,
resources on the Web has its own range of varying uses and
eliminating modeling impedance mismatch between client
meanings, including capturing ratings from users about the
RDF objects and relational storage, and easier maintenance of
quality of information and services. Web site design
RDF applications.
influences trust on content and content providers, propagating
trust over links. The new work in trust is harnessing both the The RDF infrastructure deployed was able to easily and
potential gained from machine understanding, and addressing quickly retrieve biological data that related to the probe set
the problems of reliance on the content available in the web. biomarkers. It also became possible for users to select or
The information is key to support decisions for automated search for the bioinformatics information of interest, rather
detection making. than manually collecting identifiers to enable jumping
between data sources. The faceted browsing methods used to
Trust concerns on the Semantic Web Bizer and Oldakowski
to remove all data which are not meet the filter conditions,
[5] make several claims with the Semantic Web in mind.
and able to select multiple entities for simultaneous search,
Initially, any statements contained in the Semantic Web must
and a possibility to identify clustering of genes within genes.
be considered as claims rather than facts until trust can be
established. Next, it works makes the case that it is too much
of a burden to provides trust information current. Then,
3.5 Semantic Web in Biometrics
context-based trust matters; in this case, context refers to the
Biometrics place the predominate role in all the fields of
circumstances and associations of the target of the trust
human identification system (HIS). It also inherits commonly
decision. Filtering information based on trust.
co-occurring semantic labels. This label structure occurs due
to genetic, morphological and social factors and can be
3.3 Semantic in Web Life Science exploited to improve semantic biometrics. This method
considers conceal a persons physical attributes from the
The Semantic web plays vital role broad category to achieve
camera using various set of features. These features can affect
the goals of bio-informaticians. In this area a huge quantities
the automatic semantic annotation of the biometric data,
of data would this annotations are presents on the web. These
leading to incorrect semantic labels. By utilizing the semantic
are highly distributed and heterogeneous. It has been exists at
structure we can compensate for missing visual features and
many levels. The Semantic Web technologies and the vision
correct erroneous semantic labels. The semantic structure can
provide the solution by integrating the views of
24 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security
National Conference on Research Issues in Image Analysis & Mining Intelligence (NCRIIAMI-2015)
25 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security
National Conference on Research Issues in Image Analysis & Mining Intelligence (NCRIIAMI-2015)
D. Rajakumari
Assistant Professor,
Dept of Comp. Sci
Nandha Arts & Science
College, Erode, India
rsrajakumarid@gmail.com
26 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security
National Conference on Research Issues in Image Analysis & Mining Intelligence (NCRIIAMI-2015)
important method for classifying objects based on closest types of datas will sense a d point data vector. Let x~0 m;
training data in the feature space. It is simplest among all x~1 m; x~2 m; x~k m denote the d-point data vectors at Si;
machines learning algorithm but, the accuracy of k- NN Si1; Si2;Sik, in the set N(Si), at the mth time instant
algorithm can be degraded by presence of noisy features. This respectively. The goal is to identify every new measurement
observation is performed using training to consist 3000 arriving at Si as normal or anomalous in real-time. KM-SVM
instances with 14 different attributes. The dataset is divided determines the radius of quarter-sphere based on attribute
into two testing and training i.e. 70% of data are used for correlations in addition to the spatial-temporal correlations
training and 30 % is used for testing. The authors concluded [6].
that KMSVM algorithm performs well when compared to
other algorithms. Fig 1: Region R for KM-SVM
27 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security
National Conference on Research Issues in Image Analysis & Mining Intelligence (NCRIIAMI-2015)
regression estimation have been proposed [2], but accuracy is Step 2: the K-Means clustering algorithm is run on the
not the best measure to detect. Here compare our research to original data and all cluster centers are regarded as the
some relevant literature somehow related with the presented compressed data for building classifiers.
work. Some relevant papers are grouped into categories Step 3: SVM classifiers are build on the compressed data.
according to the specific task and they are intended to solve Step 4: three input parameters are adjusted by the heuristic
such as Outlier detection [14], Outlying property discovery searching strategy proposed in this paper according to a
(OPD), Subgroup discovery, emerging patterns (EPs), tradeoff between the testing accuracy and the response time.
Contrast sets, Rule-based classifiers and Association rule Step 5: return to step 1 to test the new combination of input
mining. parameters and stop if the combination is acceptable
according to testing accuracy and response time.
3.2 Existing System Advantages of proposed system:
In existing system, patterns detection and outliers has This algorithm works in two phases, each phase working in
emerged as an important area of work in the field of data any one of the two classes. Steps of critical nuggets
mining. Outliers are abnormal objects that deviate from other identification is below,
normal objects. Finding the outliers in several applications Identify the approximate boundary
such as credit card fraud detection and identifying network
The neighbourhood data are considered around.
intrusions help in identifying the outliers. The system
The boundary set. Based on KM-SVM value,
provides the idea of finding the outliers detection based on
critical nuggets are identified.
their difference in the object distance. In Classification task,
by detecting the outlier in the data set provides less accuracy Improves the classification accuracy
result obtained. The mining of outliers provides to indentify Class imbalance is overcome by fine turning
records that are different from rest of the data sets. But, the
existing work not much focuses on finding critical nuggets of
information in the data sets. The nuggets of information are
not been detected by outlier detection.
28 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security
National Conference on Research Issues in Image Analysis & Mining Intelligence (NCRIIAMI-2015)
29 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security
National Conference on Research Issues in Image Analysis & Mining Intelligence (NCRIIAMI-2015)
results for an immerge between immune-computing and [9] F. Angiulli and C. Pizzuti, Outlier Mining in Large
features reduction. Where an immune-computing is one of HighDimensional Data Sets, IEEE Trans. Knowledge Data
the newest directions in bio-inspired machine learning and Eng., vol. 17, no. 2, pp. 203-215, Feb. 2005.
has very fruitful successes in different area. The [10] Y. Tao, X. Xiao, and S. Zhou, Mining Distance-
classification selection theory is one of the first applied Based Outliers from Large Databases in Any Metric Space,
theories in KM-SVM, and in this paper supported with Proc. 12th ACM SIGKDD Intl Conf. Knowledge Discovery
features reduction technique Support Vector Machine as a and Data Mining (KDD), T. EliassiRad, L.H. Ungar, M.
first step before the start of immune defense. The Craven, and D. Gunopulos, eds., pp. 394-403, 2006.
proposed KM-SVM has done with good improved results [11] V. Chandola, A. Banerjee, and V. Kumar, Anomaly
against Breast Cancer datasets. Detection: A Survey, ACM Computing Survey, vol. 41, no.
3, article 15, 2009.
[12] S.D. Bay and M. Schwabacher, Mining Distance-
Based Outliers in Near Linear Time with Randomization and
6. REFERENCES a Simple Pruning Rule, Proc. Ninth ACM SIGKDD Intl
[1] A.Koufakou and M..Georgiopoulos, A Fast Outlier Conf. Knowledge Discovery and Data Mining (KDD), L.
Detection Strategy for Distributed High-Dimensional Data Getoor, T.E. Senator, P. Domingos, and C. Faloutsos, eds., pp.
Sets with Mixed Attributes, Data Mining and Knowledge 29-38, 2003.
Discovery, vol. 20, no. 2, special issue SI, pp. 259-289, Mar. [13] A. Ghoting, S. Parthasarathy, and M.E. Otey, Fast
2010. Mining of distance-Based Outliers in High-Dimensional
[2] R.A. Weekly, R.K. Goodrich, and L.B. Cornman ,An Datasets, Data Mining and Knowledge Discovery, vol. 16,
Algorithm for Classification and Outlier Detection of Time- no. 3, pp. 349-364, 2008.
Series Data, J. Atmospheric and Oceanic Technology, vol. [14] M.M. Breunig, H.-P. Kriegel, R.T. Ng, and J. Sander,
27, no. 1, pp. 94-107, Jan. 2010. LOF: Identifying Density-Based Local Outliers, SIGMOD
[3] M. Ye, X. Li, and M.E. Orlowska, Projected Outlier Record, vol. 29, no. 2, pp. 93-104, 2000.
Detection in High-Dimensional Mixed-Attributes Data Set, [15] L. Duan, L. Xu, Y. Liu, and J. Lee, Cluster-Based
Expert Systems with Applications, vol. 36, no. 3, pp. 7104- Outlier Detection, Annals of Operations Research, vol. 168,
7113, Apr. 2009. no. 1, pp. 151- 168, http://dx.doi.org/10.1007/s10479-008-
[4] K. McGarry, A Survey of Interestingness Measures 0371-9, Apr. 2009.
for Knowledge Discovery, Knowledge Eng. Rev., vol. 20, [16] N. Panda, E.Y. Chang, and G. Wu, Concept
no. 1, pp. 39-61, 2005. Boundary Detection for Speeding Up SVMs, Proc. 23rd Intl
[5] L. Geng and H.J. Hamilton, Interestingness Measures Conf. Machine Learning (ICML), W.W. Cohen and A. Moore
for Data Mining: A Survey, ACM Computing Surveys, vol. eds., vol. 148, pp. 681-688, 2006.
38, article 9, http://doi.acm.org/10.1145/1132960.1132963, [17] P. Domingos, Metacost: A General Method for
Sept. 2006. Making Classifiers Cost-Sensitive, Proc. Fifth Intl Conf.
[6] E.Triantaphyllou, Data Mining and Knowledge Knowledge Discovery and Data Mining, pp. 155-164, 1999.
Discovery via Logic-Based Methods. Springer, 2010. [18] A. Frank and A. Asuncion, UCI Machine Learning
[7] E.M. Knorr, R.T. Ng, and V. Tucakov, Distance- Repository, http://archive.ics.uci.edu/ml, 2010.
Based Outliers: Algorithms and Applications, VLDB J., vol. [19] David Sathiaraj and Evangelos Triantaphyllou On
8, no. 3/4, pp. 237- 253, 2000. identifying Critical Nuggets of Information during ification
[8] D. Hawkins, Identification of Outliers (Monographs on Tasks IEEE transactions on knowledge and data engineering,
Statistics and Applied Probability). Springer, vol. 25, no. 6, June 2013.
http://www.worldcat.org/isbn/ 041221900X, 1980.
30 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security
National Conference on Research Issues in Image Analysis & Mining Intelligence (NCRIIAMI-2015)
ABSTRACT coding regions of the genome [1]. The coding gene changed
In the innovative development Information technology it has into amino acid sequences and the non-coding gene changed
been join hands with various areas especially in the level of gene expression. SNPs occur throughout the
bioinformatics. Due to this emergence the DNA plays a vital human genome about one in every 300 nucleotide base pairs.
role in classifies human sequence. Single Nucleotide This translates to about 10 million SNPs within the 3-billion-
Polymorphism (SNP) is the simplest form of DNA variation nucleotide human genome [2]. DNA contains at least 10
among individuals. All human beings differ from other by million SNPs. The SNP can be discovered through shotgun
physical appearance, susceptibility of disease and response of sequencing and resequencing of targeted genomic regions [3].
drugs. These differences are occurred by a single nucleotide The set of SNP in the genome called haplotype blocks. The
change in the DNA sequence of human. The analysis of SNPs haplotype analysis is used for finding the diseases affected
identifies the disease susceptibility in a gene using the gene gene location. In this paper we present an overview of the
mapping techique. Mainly SNPs are applied for personal methodology for mining SNP from DNA sequence data and
medication. In this paper, we describe the process to discover analysis of SNP discovery software tools. Gene mapping is
SNPs and its processing tools. the process of establishing the locations of genes on the
chromosomes. Early gene maps used linkage analysis. More
recently, scientists have used recombinant DNA (rDNA)
Keywords techniques to establish the actual physical locations of genes
Single nucleotide polymorphism (SNP), Disease on the chromosomes.
susceptibility, DNA, Personal medication, Gene Mapping.
2. SNP DISCOVERY PROCESS
Generally mining SNP from DNA sequence data by following
1. INTRODUCTION steps:
A Single Nucleotide Polymorphism (SNP) is a DNA sequence
variation occurring when a single nucleotide A, T, C, G in the 1. Read the sequence data.
genome differs between members of a species. Generally, 2. Grouping sequence reads by using the method
99.9% of human DNA sequence A, G, C and T bases are clustering or mapping.
3. Aligning the sequence reads.
identical and remaining 0.01% makes a person unique. The
4. Identifying the sequence variant as potential
sequences are selected from the same region of the genome of polymorphisms.
two different people and the two sequences are compared to The process is shown in Figure. 2
find the variations that may be SNPs or insertions/deletions
(Indels). The SNP and Indel are represented in the following 3. SEQUENCE DATA
Figure.
The sequence data can be classified into two types. They
are
XACGGTAGTA 1. Reference sequence data
2. De novo sequence data.
SNPINDEL
3.1 Reference Sequence Data
YACGATAGTAA The reference sequence data are known fragments and already
it has sequence to refer the new sequence data. The examples
of reference sequence data are human genome and many
Fig. 1 SNP Variation microbial species. In recent years the species are rapidly
From the above Figure. 1 X and Y are indicates the DNA increasing, so Next Generation Sequencing (NGS)
sequence of two humans. The DNA sequence of X and Y are technologies are used. It allowing high-throughput (HT)
almost identical remaining two nucleotide only changed. It is sequencing will likely to bring many SNPs to the attention of
called as SNP or Indel (Insertion/deletion). The variations researchers [4]. The NGS technologies include:
may be of harmful or harmless. In the case of harmful, it
cause some disease and for harmless, it change the phenotype
(appearance, charaterstics) only. SNPs are the most abundant
form of genetic variation and are the basis for most molecular
markers. Many SNPs can occur in coding (gene) and non-
31 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security
National Conference on Research Issues in Image Analysis & Mining Intelligence (NCRIIAMI-2015)
Sequencevariantidentification
Referencesequence Mapping
data(Knownseq.)
(BLAST) Alignmentand
Assembly FindingSNPs
Denovosequence Cluster
data(Unknownseq.)
(d2cluster)
32 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security
National Conference on Research Issues in Image Analysis & Mining Intelligence (NCRIIAMI-2015)
includes a database to keep track of the status of sequence for visualization of SNPs and alleles, which supplies the
variants and annotation. selection of informative SNP and allele-specific markers.
Zhang et al. [17] developed a tool for sensitive and accurate Amigo et al. [23] used a tool called ENGINES (ENtire
detection of SNP called SNPdetector. It is aimed at Genome INterface for Exploring SNVs). It is used to retrieve
automating the manual review step in the SNP mining single nucleotide variation from 1000 genomes. The whole
procedure, as an alternative to PolyPhred. It starts from base dataset is pre-processed and summarized into a data mart
calling quality scores (Phred) on specific amplicon fragments. accessible through a web interface. The query system allows
These are aligned using SIM [18], allowing for substantial the combination and comparison of each available population
sequence variation. Next, polymorphisms are identified in sample, while searching by rs-number list, chromosome
regions of high base quality (neighborhood quality standard). region, or genes of interest. Frequency and FST filters are
Finally, heterozygote genotypes are identified and SNPs available to further refine queries, while results can be
evaluated by means of comparing forward and reverse visually compared with other large-scale Single Nucleotide
sequencing data. SNPdetector is tested on mouse and human Polymorphism (SNP) repositories such as HapMap or
dataset. It reduces the false positive rate and false negative Perlegen.
rate compared to PolyPhred and NovoSNP.
5. CONCLUSION
Manaster et al. [19] introduced a tool called InSNP for In this paper a deep analyzed study on SNP tools. SNP
automated detection and visualization of SNP and indels. It is
analyses are mainly used for personalized medication. From
similar to PolyPhred and novoSNP. This software was
targeted for mutation detection and compared its performance this analysis on SNP tools we find the ENGINES tool is the
to PolyPhred and Mutation surveyor. It reduced the false efficient to compare with previous tools such as polyphred,
positive rate and false negative rate compared to PolyPhred. HaploSNPer and novoSNP. SNPs are mainly used to find the
disease susceptibility gene location using the gene mapping
Dereeper et al. [20] introduced a web based tool called technique.
SNiPlay, for detecting SNPs and indels in DNA sequence.
From standard sequence alignments, genotyping data or
Sanger sequencing traces given as input, SNiPlay detects 6. REFERENCES
SNPs and indels events and outputs submission files for the [1] Garg, K., Green, P., and Nickerson, D. A. 1999.
design of Illuminas SNP chips. Subsequently, it sends Identification of candidate coding region single
sequences and genotyping data into a series of modules in nucleotide polymorphisms in 165 human genes using
charge of various processes: physical mapping to a reference assembled expressed sequence tags. Genome Research,
genome, annotation, SNP frequency determination in user- 9(11), 1087-1092.
defined groups, haplotype reconstruction and network, linkage [2] http://learn.genetics.utah.edu/content/pharma/snips/
disequilibrium evaluation, and diversity analysis. When input
is provided as standard FASTA alignments, SNiPlay uses a [3] Salisbury, B. A., Pungliya, M., Choi, J. Y., Jiang, R.,
home-made Perl module to detect SNPs and insertion/deletion Sun, X. J., and Stephens, J. C. 2003. SNP and haplotype
events and to extract allelic information for each polymorphic variation in the human genome. Mutation
position. A database storing polymorphisms, genotyping data Research/Fundamental and Molecular Mechanisms of
and grapevine sequences released by public and private Mutagenesis, 526(1), 53-61.
projects. It allows the user to retrieve SNPs using various
[4] Van Oeveren, J., and Janssen, A. 2009. Mining SNPs
filters, to compare SNP patterns between populations, and to
from DNA sequence data; computational approaches to
export genotyping data or sequences in various formats.
SNP discovery and analysis. In Single Nucleotide
Polymorphisms (pp. 73-91). Humana Press.
Batley et al. [21] presented a tool called SNPserver. It is a
real-time flexible tool for the discovery of SNPs within DNA [5] Altschul, S. F., Gish, W., Miller, W., Myers, E. W., and
sequence data. The program uses BLAST, to identify related Lipman, D. J. 1990. Basic local alignment search tool.
sequences, and CAP3, to cluster and align these sequences. Journal of molecular biology, 215(3), 403-410.
The alignments are parsed to the SNP discovery software
autoSNP, a program that detects SNPs and insertion/deletion [6] Ning, Z., Cox, A. J., and Mullikin, J. C. 2001. SSAHA: a
polymorphisms (indels) in maize expressed sequence tag data. fast search method for large DNA databases. Genome
SNP discovery is performed using a redundancy-based research, 11(10), 1725-1729.
approach. [7] Smit, A. F., Hubley, R., and Green, P. 1996.
RepeatMasker Open-3.0.
Tang et al. [22] developed a tool called HaploSNPer. It is a
flexible web-based tool for detecting SNPs and alleles in user- [8] Green, S. J., Monreal, R. P., White, A. T., Bayer, T. G.,
specified input sequences from both diploid and polyploid Arquiza, Y. D., Buenaflor Jr, R., and Arquiza, N. Y.
species. It includes BLAST for finding homologous sequences 1999. PHRAP documentation.
in public EST databases, CAP3 or PHRAP for aligning them, [9] Huang, X., and Madan, A. 1999. CAP3: A DNA
and QualitySNP for discovering reliable allelic sequences and sequence assembly program. Genome research, 9(9),
SNPs. All possible and reliable alleles are detected by a 868-877.
mathematical algorithm using potential SNP information.
Reliable SNPs are then identified based on the reconstructed [10] Burke, J., Davison, D., and Hide, W. 1999. d2_cluster: a
alleles and on sequence redundancy. It can efficiently detect validated method for clustering EST and full-length
reliable SNPs, reconstruct haplotypes and therefore identify cDNA sequences. Genome research, 9(11), 1135-1142.
different alleles using only EST sequence information.
Furthermore, HaploSNPer supplies a user friendly interface
33 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security
National Conference on Research Issues in Image Analysis & Mining Intelligence (NCRIIAMI-2015)
[11] Pertea, G., Huang, X., Liang, F., Antonescu, V., Sultana, [23] Amigo, J., Salas, A., and Phillips, C. 2011. ENGINES:
R., Karamycheva, S. and Quackenbush, J. 2003. TIGR exploring single nucleotide variation in entire human
Gene Indices clustering tools (TGICL): a software genomes. BMC bioinformatics, 12(1), 105.
system for fast clustering of large EST datasets.
Bioinformatics, 19(5), 651-652.
[12] Nickerson, D. A., Tobe, V. O., and Taylor, S. L. 1997.
PolyPhred: automating the detection and genotyping of
single nucleotide substitutions using fluorescence-based
resequencing. Nucleic acids research, 25(14), 2745-2751
[13] Stephens, M., Sloan, J. S., Robertson, P. D., Scheet, P.,
and Nickerson, D. A. 2006. Automating sequence-based
detection and genotyping of SNPs from diploid samples.
Nature genetics, 38(3), 375-381.
[14] Marth, G. T., Korf, I., Yandell, M. D., Yeh, R. T., Gu, Z.,
Zakeri, H., and Gish, W. R. 1999. A general approach to
single-nucleotide polymorphism discovery. Nature
genetics, 23(4), 452-456
[15] Wang, J., and Huang, X. 2005. A method for finding
single-nucleotide polymorphisms with allele frequencies
in sequences of deep coverage. BMC bioinformatics,
6(1), 220.
[16] Weckx, S., Del-Favero, J., Rademakers, R., Claes, L.,
Cruts, M., De Jonghe, P.and De Rijk, P. 2005. novoSNP,
a novel computational tool for sequence variation
discovery. Genome Research, 15(3), 436-442.
[17] Zhang, J., Wheeler, D. A., Yakub, I., Wei, S., Sood, R.,
Rowe, W., and Buetow, K. H. 2005. SNPdetector: a
software tool for sensitive and accurate SNP detection.
PLoS computational biology, 1(5), e53.
[18] Huang, X., Hardison, R. C., and Miller, W. 1990. A
space-efficient algorithm for local similarities. Computer
applications in the biosciences: CABIOS, 6(4), 373-381.
[19] Manaster, C., Zheng, W., Teuber, M., Wchter, S.,
Dring, F., Schreiber, S., and Hampe, J. 2005. InSNP: a
tool for automated detection and visualization of SNPs
and InDels. Human mutation, 26(1), 11-19.
[20] Dereeper, A., Nicolas, S., Le Cunff, L., Bacilieri, R.,
Doligez, A., Peros, J. P., and This, P. 2011. SNiPlay: a
web-based tool for detection, management and analysis
of SNPs. Application to grapevine diversity projects.
BMC bioinformatics, 12(1), 134.
[21] Savage, D., Batley, J., Erwin, T., Logan, E., Love, C. G.,
Lim, G. A., and Edwards, D. 2005. SNPServer: a real-
time SNP discovery tool. Nucleic acids research,
33(suppl 2), W493-W495.
[22] Tang, J., Leunissen, J. A., Voorrips, R. E., van der
Linden, C. G., and Vosman, B. 2008. HaploSNPer: a
web-based allele and SNP detection tool. BMC genetics,
9(1), 23.
34 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security
National Conference on Research Issues in Image Analysis & Mining Intelligence (NCRIIAMI-2015)
35 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security
National Conference on Research Issues in Image Analysis & Mining Intelligence (NCRIIAMI-2015)
36 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security
National Conference on Research Issues in Image Analysis & Mining Intelligence (NCRIIAMI-2015)
known order-statistics filter is the median filter, which and determine the variance in the number of black pixels per
replaces the value of a pixel by the median of the gray levels projected line.The projetion parallel to the true alignment of
in the neighborhood of that pixel, the lines will likely have the the maximum variance, since
when parallel, each given ray projected through the image will
The median filter is an effective method that can hit either almost no black pixels (as it passes between text
suppress isolated noise without blurring sharp edges. In lines) or many black pixels (while passing through many
Median Filtering, all the pixel values are first sorted into characters in sequence).
numerical order and then replaced with the middle pixel
value. [8]Let y represent a pixel location and w represent a Skew Correction
neighborhood centered around location (m, n) in the image,
then the working of median filter is given by After the skew angle of the page has been detected,
our recognition algorithm demands that the page must be
y [m, n]=median{x[ i ,j],( i , j) belongs to w}.......(1) rotated to correct for this skew. Our rotation algorithm had to
be both fairly fast and fairly accurate. It was a pure coordinate
Since the pixel y[m ,n] represents the location of the transformation, which takes a little bit of time on large
pixel y ,m and n represents the x and y co-ordinates of y. W images, but gets the rotation exact.
represents the neighborhood pixels surrounding the pixel
position at (m, n).( i , j) belongs to the same neighborhood 5.3 Algorithm
centered around (m, n).Thus the median method will take the
median of all the pixels within the range of ( i , j) represented Input: original Image
by x[ i, j]. Output: Preprocessed image
The original value of the pixel is included in the Algorithm:
computation of the median. Median filters are quite popular
because, for certain types of random noise they provide Step 1: Start
excellent noise reduction capabilities, with considerably less Step 2 : Select an Input Image from the DB
blurring than linear smoothing filters of similar size. Fig 1.2
illustrates an example of how the median filter is calculated. Step 3: Preprocess the image by the following
Apply Median filter on Input Image
Calculate the threshold value.
Apply the threshold value in the image .Estimate
MSE and PSNR
step 4 : Correct the skew lines in the image.
Step 5 : Repeat step 2 to step 4 for all images in DB
Step 6: stop
6.SIMILARITY MEASURES
6.1 Mean Squared Error (MSE)
Mean square error is calculated by
Fig. 1.2 Median Filter
1
i =1 i =1
[ g(i, j) f (i, j)]
M N 2
5.2 Thresholding MSE =
The task of thresholding is to extract the foreground from the MN
background. A number of thresholding techniques have been Where M and N are the total number of pixels in the
previously proposed using global and local techniques. Global horizontal and the vertical dimensions of image, g denotes the
methods apply one threshold to the entire image while local Noise image and f denotes the filtered image.
thresholding methods apply different threshold values to
different regions of the image[Leedham]. The histogram of 6.2 Peak Signal to Noise Ratio (PSNR)
gray scale values of a document image typically consists of
The peak Signal to Noise ratio is calculated by:
two peaks: a high peak corresponding to the white
background and a smaller peak corresponding to the 2552
foreground. So the task of determining the threshold gray- P S N R = 1 0 lo g 1 0
scale value is one of determining as optimal value in the M SE
valley between the two peaks. Here Otsus method of
histogram based global thresholding algorithm is used. For the image quality measures, if the value of the PSNR is
very high for an image of a particular noise type then is best
5.3 Skew Detection quality image.
There are several commonly used methods for
detecting skew in a page, some rely on detecting connected
components (for many purposes, they are roughly equivalent 7. EXPERIMENTATION AND RESULTS
to characters) and finding the average angles connecting their The proposed model is experimented with different
centroids. The method we employed (after observing it in image in the database. The experiments were carried out in
Fateman's program) was to project the page at several angles, Matlab R2014a. The morphological operations were
37 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security
National Conference on Research Issues in Image Analysis & Mining Intelligence (NCRIIAMI-2015)
performed using Matlabs Image Processing Toolbox. The From the below figure shows the pictorial representation of
dataset consisted of several samples out of which randomly the performance evaluated. By analysing the obtained results
selected samples were used for testing. the proposed model produced the best results. Hence the
Proposed method is an efficient
one.
60
50
40 Gaussian
30 Mean
20 Wiener
9.REFERENCES
[1]. N.Dhamayanthi, and P.Thangavel, Handwritten
Tamil character recognition using neural network,
Fig 5: Skew corrected Image Proceeding of Tamil Internet 2000, Singapore, July
22-24,2000, pp. 171-176.
The experimented results of the proposed method and the
existing method with the computation of PSNR values are [2]. S.Hewavitharana, H.C.Femando, A two stage
presented in the following tables. classification approach to Tamil Handwriting
recognition, Tamil Internet 2002, California, USA,
Table 1: Resultant Table pp. 118-124
Image No. Gaussian Mean Wiener Proposed [3]. Srihari, Online and Offline handwriting
recognition: A comprehensive survey, IEEE
1 30.25 29.47 30.77 53.52 PAMI, Vol.22, No.l, Jan.2000.
[4]. P.Chinnuswamy and S.G.Krishnamoorthy,
2 31.05 30.58 29.63 56.62 Recognition of hand printed Tamil characters,
Pattern recognition Vol.12, ppl41-152,1980.
3 30.15 29.62 27.95 55.65
[5]. N.Otsu, A threshold selection method from grey
4 31.10 30.86 30.99 56.68 level histogram, IEEE Transaction. Syst. Man
Cyber., vol.9 no.l, 1979, pp. 62-66.
[6]. K. K. M. BAHETI M. J., "Comparison Of
From the table 1 shows the experimented values obtained Classifiers For Gujarati Numeral Recognition,"
from different preprocessing methods. It shows the selected International Journal of Machine Intelligence,
face image from the database. The performance was evaluated vol.no. 3, pp. 160-163,2011.
using the Mean Square Error (MSE) and Peak Signal Noise
Ratio (PSNR) in order evaluates the quality of the image. By [7]. A. A. Desai, "Gujarati handwritten numeral optical
the analysis of the values in the table the Proposed Method is character reorganization through neural network,"
better with less MSE and high PSNR values. Pattern Recognition, vol. 43, 2010.
38 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security
National Conference on Research Issues in Image Analysis & Mining Intelligence (NCRIIAMI-2015)
39 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security
National Conference on Research Issues in Image Analysis & Mining Intelligence (NCRIIAMI-2015)
R.Sridevi, R.Rahinipriyadharshini,
Assistant professor, Research Scholar,
Department of computer science Department of Computer Science
PSG College of Arts & Science, PSG College of Arts & Science
Coimbatore, TamilNadu. Coimbatore, TamilNadu.
srinashok@gmail.com rahinipriyadharshini@gmail.com
ABSTRACT hackers to get private data from the users. As a result, many
In the todays world, security is required to transmit researches are needed to find safer ways to transfer data and
confidential information over the network. The security of information on the electronic devices.
information and communication becomes more and more
important with the rapid development of information
2. RELIABLE DATA TRANSMISSION
technology. Cryptographic algorithms play a vital role in AND USAGE IN ELECTRONIC
providing the data security against malicious attacks. The GADGETS
main goal of cryptography is not only to protect data from Understanding where you hold private data internally is
being attacked or stolen rather it can be used for essential, but it does not provide a complete view of where
authentication for users. There are two types of cryptographic your data resides. Understanding data flows is a complicated
algorithms namely Symmetric key cryptographic algorithms task, and technology can help to provide transparency into
and Asymmetric key cryptographic algorithms. Symmetric network based data transfers. However, you must additionally
key cryptographic algorithm uses the only one key for both understand third-party data access, business-driven data
encryption and decryption. Asymmetric key uses two exchange and end- user data transmission capabilities.
different keys to encryption and decryption of the message.
The public key is made publicly available and can be used to 2.1 Third party data access
encrypt messages. The private key is kept secret and can be Data identification should include obtaining an understanding
used to decrypt received messages. This paper is about how of private data that is accessible by third parties. This includes
the data are protected while transferring the data from one data that is exchanged with third parties and third parties that
electronics devices to another using the RSA algorithm. have direct access to internal systems. Once an inventory of
Nowadays, many electronic devices like electronic phones, third parties with access to private data exists, controls should
tablets, personal computers are in the workplace for be implemented to safeguard third-party access. Focus areas
transferring the data. RSA algorithm provides privacy, should include:
integrity and confidentiality. It is mandatory to secure that
information to prevent from unauthorized access of it by any Secure data transmissions
node in the path.
Controlled access to data
Keywords
Monitoring of third-party access to private data.
Cryptography, encryption, decryption, asymmetric
Cryptographic algorithms, RSA algorithm. Third-party due information security assurance.
40 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security
National Conference on Research Issues in Image Analysis & Mining Intelligence (NCRIIAMI-2015)
Private data may not be transmitted through public common words or word groups found in a dictionary. For this
networks without adequate encryption. reason, user-selected passwords are generally considered to be
weaker than randomly-generated passwords.
Approved technologies may be used to exchange
data with third parties. 3.4 Viruses and Worm
Private data may not be shared with third parties Terminology is emerging such as "malware" (or malicious
without sufficient contracts in place specifying information software). Instant Messaging (IM) is very popular because it
provides flexibility, speed and ease of communication, but it
security requirements, their obligations to protect company
is also very vulnerable to attacks because of its flexibility.
data, their responsibilities for monitoring their own third
Attacks are not limited to personal computers (PC). They now
parties and the companys right to audit and monitor.
include cell phones and other processor-based electronics and
Access to private data must be logged and will only increase and become more sophisticated. To protect
monitored where appropriate. the security system database from unwanted electronic
intruders requires that no software be introduced into the
Private data must be anonym zed before being
security network without the Security management's approval.
stored in less controlled environments, such as 0test and
development environments.
3.5 Backup
Private data must be adequately protected through Backups are kept for several reasons: a computer crash,
all stages of the data lifecycle and the systems development documentation, employee/contractor investigation, and file
lifecycle. corruption. For these reasons, there must be several levels of
backup available. The requirements will vary depending upon
2.3 How to protect the electronic gadgets? a specific application.
41 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security
National Conference on Research Issues in Image Analysis & Mining Intelligence (NCRIIAMI-2015)
4.1 RSA Algorithm and Private key pair comprise of two uniquely related
cryptographic keys (basically long random numbers. 3048
RSA is a commonly adopted public key cryptography 0241 00C9 18FA CF8D EB2D EFD5 FD37 89B9 E069 EA97
algorithm. The first, and still most commonly used FC20 5E35 F577 EE31 C4FB C6E4 4811 7D86 BC8F BAFA
asymmetric algorithm RSA is named for the three 362F 922B F01B 2F40 C744 2654 C0DD 2881 D673 CA2B
mathematicians who developed it, Rivest, Shamir, and 4003 C266 E2CD CB02 0301 0001. Because the key pair is
Adleman. RSA today is used in hundreds of software products mathematically related, whatever is encrypted with a Public
and can be used for key exchange, digital signatures, or Key may only be decrypted by its corresponding Private Key
encryption of small blocks of data. RSA uses a variable size and vice versa. Public-key cryptography finds application in,
encryption block and a variable size key. The key pair is amongst others, the IT security discipline information
derived from a very large number, n, that is the product of two security. Public-key cryptography is used as a method of
prime numbers chosen according to special rules. Since it was assuring the confidentiality, authenticity and non-
introduced in 1977.RSA has been widely used for establishing reputability of electronic communications and data storage.
secure communication channels and for authentication the Whereas the private key is used to decrypt cipher text or to
identity of service provider over insecure communication create a digital signature.
medium. in the authentication scheme, the server implements
public key authentication with client by signing a unique 4.4 Encryption Algorithm
message from the client with its private key, thus creating The term is most often associated with scrambling plaintext
what is called a digital signature. The signature is then into cipher text this process is called encryption. Encryption
returned to the client, which verifies it using the servers is the conversion of electronic data into another form, called
known public key. Encryption algorithms consume a cipher text, which cannot be easily understood by anyone
significant amount of computing resources such as CPU time, except authorized parties. The primary purpose of encryption
memory, and battery power. This paper examines a method is to protect the confidentiality of digital data stored on
for evaluation performance of various algorithms. A computer systems or transmitted via the Internet or other
performance characteristic typically depends on both the computer networks. Modern encryption algorithms play a
encryption key and the input data. A comparative analysis is vital role in the security assurance of IT systems and
performed for those encryption algorithms at different sizes of communications as they can provide not only confidentiality,
data blocks, finally encryption/decryption speed. Features of but also the following key elements of security.
the RSA algorithm.
4.5 Decryption Algorithm
The process of converting a cipher text message into a
plaintext message Key.RSA cryptosystem is the most
attractive and popular security technique for many
applications, such as electronic commerce and secure Internet
access. The decryption operation has to take more
computational cost to perform modular exponentiation by this
case. Algorithm for Encryption and decryption process is
represented below.
42 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security
National Conference on Research Issues in Image Analysis & Mining Intelligence (NCRIIAMI-2015)
implementing in the electronic gadgets for protecting the [4] Manikandan.G, Rajendiran.P, Chakarapani.K,
private data. Krishnan.G, Sundarganesh.G, A Modified Crypto
Scheme for Enhancing Data Security, Journal of
6. CONCLUSION Theoretical and Advanced Information Technology, Jan
Technological innovations have put an end to traditional 2012.
working process. Todays business world is mainly depending
upon the electronic gadgets for their working process such [5] Wang, X., Cryptanalysis for Hash Functions and Some
online business, e-commerce, and for file transferring through Potential Dangers, RSA Conference 2010,
electronic gadgets over the network. This new technologies Cryptographers Track, 2010.
are supposed to occur some serious threats to information [6] Dr. Sandeep Sharma and Rishabh Arora Performance
security. So keeping this in mind necessary planning activities Analysis of Cryptography Algorithms International
should be done to safe guard the information at all levels Journal of Computer Applications (0975 8887) Volume
within a corporation or organization, to secure the data. It is 48 No.21, June 2012.
also important to standardize working process; to determine,
which procedures are in compliance with information security [7] A. Lazou., and G. R. Weir, Perceived Risk and
standards; and safety policies to be set for the usage of Sensitive Data on Mobile Devices, UK. University of
electronic devices and wireless networks. The proposed Strathclyde Publishing., pp. 183-196, 2011.
algorithm reduces the effectiveness of intrusion attacks as [8] William Stallings, Cryptography and Network
only a part of the message will be available even if the Security, ISBN 81-7758-011-6, Pearson Education,
intruder interrupts any message and decrypts it. Also Third Edition.
decrypting part of a message is not very easy. It uses RSA
Algorithm, one of the most effective and commonly used [9] Atul Kahate, Computer and Network Security, Third
cryptographic algorithms and adds more steps to it to reduce Edition,Tata McGraw Hill Publication Company
attacks A crucial part of information security, is the privacy Limited,2013
issue the standards for this issue should also be defined within [10] Ambedkar, B. R., Gupta, A., Gautam, P., & Bedi, S. S.
an organization. (2011, June). An Efficient Method to Factorize the RSA
Public Key Encryption. In Communication Systems and
7. REFERENCES Network Technologies (CSNT), 2011 International
[1] Zhiyong Peng Xiaojuan Li, "The improvement and Conference on (pp. 108-111). IEEE.
simulationofLeachprotocolforWSNs,"Software Engineeri
ng and Service Sciences (ICSESS), 2010 IEEE Internatio [11] NeetuSettia. Cryptanalysis of modern Cryptography
nal Conference on vol., no., pp.500-503, 16-18 July2011. Algorithms. International Journal of Computer Science
and Technology. December 2010.
[2] Bahadori, M.; Mali, M.R.Sarbishei, O,Atarodi, M.;
Sharifkhani, M. , "A novel approach for secure and fast [12] Nentawe Y.Goshwe(2013) Data Encryption and
generation of RSA public and private key son Smart Decryption Using RSA Algorithm in a Network
Card," NEWCAS Conference (NEWCAS), 2010 8th Environment. IJCSNS International Journal of Computer
IEEE International , vol., no., pp.265-268, 20-23 June Science and Network Security, VOL.13No.7.
2010.
[3] Massey, J.L, "An Introduction to Contemporary
Cryptology", Proceedings of the IEEE, Special Section
on Cryptography, 533-549, May 2011.
43 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security
National Conference on Research Issues in Image Analysis & Mining Intelligence (NCRIIAMI-2015)
Dr.S.K.Jayanthi V.Subhashini
Head and Associate Professor Research Scholar
Department of Computer Science Department of Computer Science
Vellalar College for Women Vellalar College for Women
Erode, India Erode, India
jayanthiskp@gmail.com vsubhamohan@gmail.com
ABSTRACT
Search engines are the excellent medium for sharing
information. Web spam reduces the quality of search
results and increases the cost of each processed query
due to the storage and retrieval of useless pages.
Spammers encourage viewers to visit their sites by
providing undeserved advertisements, malicious contents
in their pages and try to install malware on the
machine. Recently, the web spam has increased rapidly,
leading to a degradation of search results. This paper presents
the review of web spam techniques compromising search
engines.
Keywords
Web spam, Content spam, link spam, Content hiding,
Cloaking, Redirection, PageRank, HITS.
1. INTRODUCTION
World Wide Web provides the vast amount of data, as more
and more people rely on the wealth of information available in
it. The increasing importance of search engines has given rise
to web spam that exists to mislead search engines. So that the
web pages would rank high in search results, and thus, capture Figure 1: Top hit list of a major search engine for the
user attention. There are different goals for uploading a spam query download freemp3 digital camera Microsoft
page via spammers, first it is to attract viewers to visit their Linux.
sites to enhance the score of the page in order to increase Search engines consider the occurrence of the query in a web
financial benefits for the site owners, the second is to page. Each type of location is called a field. The text fields for
encourage people to visit their sites in order to introduce their a page p are the document body, the title, the Meta tags in the
company and its products, and to persuade visitors to buy HTML header, and page ps URL. The anchor
their productions, the third is to install malware on victims texts associated with URLs that point to p are also
computer. Web spam can be classified as content spam considered belonging to page p (anchor text field), they
describe the content of p. The terms in ps text fields are
(adding irrelevant word to the document to rank it high) and
used to determine the relevance of p with respect to a
link spam (spam on hyperlink) [1, 2]. specific query (a group of Query terms), with different
weights given to different fields.
2. SPAMMING TECHNIQUES The algorithms use various forms of the fundamental TFIDF
Content spamming and link spamming are two spamming
techniques that influence the ranking algorithms which is used metric to rank web pages [1].Given a specific text field, for
by search engines [5]. each term t that is common for the text field and a query, TF
(t) is the frequency of that term in the text field.
2.1 Content spamming For instance, if the term spam appears 6 times in the
Web spam manipulates the content of web pages by stuffing document body that is made up of a total of 20 terms,
it with keywords repeating several times [5]. A large amount TF(spam) is 6/20 = 0.3.
of machine generated spam pages such as shown
in Figure 1&2
44 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security
National Conference on Research Issues in Image Analysis & Mining Intelligence (NCRIIAMI-2015)
Based on TFIDF scores, spammers can have two goals: either Figure 3: A page having unrelated hyperlinks
to make a page relevant for a large number of queries (i.e., to
receive a non-zero TFIDF score), or to make a page very
relevant for a specific query (i.e., to receive a high TFIDF
score). The first goal can be processed by including a large
number of distinct terms in a document. The second goal can
be processed by repeating some targeted terms.
45 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security
National Conference on Research Issues in Image Analysis & Mining Intelligence (NCRIIAMI-2015)
Spammers create various pages that link to the spammers Inaccessible pages: Spammer cannot modify the web
target pages such as shown in Figure 5. pages. These are the pages out of reach; the spammer cannot
influence their outgoing links.
46 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security
National Conference on Research Issues in Image Analysis & Mining Intelligence (NCRIIAMI-2015)
HIDING TECHNIQUES
47 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security
National Conference on Research Issues in Image Analysis & Mining Intelligence (NCRIIAMI-2015)
ABSTRACT
In the present scenario, Educational Data Mining is one of In this paper Feature selection techniques are applied in
the emerging disciplines including the process of analyzing EDM process have been described. Data mining
the students information using different attributes, such as applications and techniques can be used to develop the
Medium of study, First Graduate, Residence, Location, education sector.
Family, Fathers Qualification, Mothers Qualification,
Annual Income, HSC Marks, and PSM. Based on the PSM 2. LITERATURE REVIEW
and other attributes the fore coming semester performance
will be predicted. Feature selection techniques are applied Acharya et al. [3] have presented a method on educational
to these attributes. During Feature selection irrelevant data mining. Feature selection techniques were applied and
attributes have been removed, relevant attributes have been relevant attributes only selected for further use. Initially 15
selected for further reference. Three feature selection features were used such as Gender, caste, religion Fsize,
methods are used and the best feature selection method is Board, Sorigin, Income, Boardmarks, Hday, Atten,
found out. Midsem, Medium, School, Ptution, and Grade. 3 feature
selection techniques were applied such as, Correlation
Keywords Based Feature Selection (CBFS), Chi-Square Based
Educational Data Mining (EDM), Feature Selection Feature Evaluation, and Information Gain Attribute
techniques, PSM (Previous Semester Marks), Information Evaluation (IGATE). After applying feature selection 8
Gain Attribute Evaluation (IGATE), Gain Ratio Based attributes were selected for further use. They compared
Feature Selection (GRFS). those feature selection techniques and results are
visualized.
1. INTRODUCTION
Ramaswamy et al. [4] have developed a technique on
Data mining techniques are used to extract the meaningful educational data mining. The data were collected from
information from a large data source using some patterns different schools in different districts of the state Tamil
and methods. It has been used in many applications such as Nadu. Initially the data set has 32 features. They use 6
EDM, Web Mining, Text mining etc. It is an emerging +feature selection techniques on their data set to select the
discipline, concerned with developing methods for best attributes. CBFS, CHFS, GRFS, IGFS, and RF. After
exploring the unique type of data that originates from applying feature selection 9 features only selected. They
educational settings, and using such methods to better compare all the feature selection methods and choose
understand the students [1]. Methods are different from which method is best one, that method were applied.
standard data mining methods. There are 3 main goals in
EDM, [2] Hany et al. [5] have developed a method to select the
optimal features for the student model. Data were collected
Pedagogical: to help in the design of didactic contents from suburban middle school in central Massachusetts.
and improvement on the academic performance of the Initially the data set has 15 features. They use Weka tool
students. for selecting the best attributes. Weka as an open source
machine learning software has been used to conduct an
Managerial: to optimize the organization and
extensive comparison of the six classifier algorithm.
maintenance of education infrastructures, areas of
interest and study researches. A classification technique has been described by Baradwaj
Commercial: to help in students recruitment in any et al. [6]. The main aim is to improve the students
performance. One of the Classification technique Decision
private education. tree was used. In this paper ID3 algorithm was used to
predict and improve the students performance. They
48 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security
National Conference on Research Issues in Image Analysis & Mining Intelligence (NCRIIAMI-2015)
collected the data from the VBS Purvanchal University, ATT: Students Attendance. It splits into 3 classes, that is,
Jaunpur Computer Applications department, of course Good above 95, Average 80-94, Poor below 80.
MCA 2007-2010. Attendance, Class test, Seminar and
Assignment marks were collected from the student HSCM: Higher Secondary Marks. Based on their marks, it
database, to predict the performance of the end of the will be categorized into 3 groups, such as Good Above
semester. Which is help the teachers to improve the 60, Average 40 - 59, Poor - below 40.
students performance. PSM: Previous Semester Marks/Grade obtained in MCA
3. DATA SET DESCRIPTION course. It is split into five class values: Grade A - Above
80, Grade B 65-79, Grade C 51-64, Grade U - Below
3.1 Data Collection 50.
All variables are the categorical values. PSM is the
The data have been collected from Ayya Nadar Janaki
dependent variable, all other variables are the predictor
Ammal College, Sivakasi, TamilNadu, India. The sampling
variables.
method of computer Applications department (Master of
Computer Applications). In this method all individual table 4. METHODOLOGY
are combined into a single table formed with all sufficient 4.1 Feature Selection
data.
Feature selection, also known as variable
3.2 Data Preperation selection, attribute selection or variable subset selection, is
the process of selecting a subset of relevant features for use
Initially the size of the data is 47. We have conducted the
in model construction [7].
survey to collect their Mark details, personal information as
well as family background from the students. The attributes Feature selection has proven in both theory and practice to
such as Name, Med, FG, Resi, LivLoc, Fsize, Fedu, Medu, be the effective enhancing learning efficiency, increasing
Finc, HSC, and PSM. Three tables (Mark details, Family predicting accuracy and reducing complexity of learned
details, Personal Details) were merged for this process. The results [8]. Feature selection in supervised learning has a
data set values for some of the variable were defined for the main goal of finding a feature that produces higher
present analysis. classification Accuracy [9].
Table1. Attribute Description
Three filter feature subset methods was used.
Attributes Description Possible Values 4.1.1 Chi-Squared Feature Selection
One of the popular feature selection method is chi-square
MED Medium of study Tamil, English 2
(X ). It is a statistical test that can be used to determine
FG First Graduate Yes, No whether observed frequencies are significantly different
from expected frequencies. Which is used to select the
RESI Residence Day Scholar, Hostel predictor variable. [10]
m
Entropy ( D ) = pi log 2 ( pi ) (2)
i =1
v | Dj |
InformationGain A ( D) = Entropy ( D j )
j =1 |D|
49 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security
National Conference on Research Issues in Image Analysis & Mining Intelligence (NCRIIAMI-2015)
(4)
It is used to improve the performance of classification
algorithm and find out the best feature selection method
[12].
Methods Attributes
50 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security
National Conference on Research Issues in Image Analysis & Mining Intelligence (NCRIIAMI-2015)
Compare to Chi-Squared method and Information gain [12] Priyadarsini, R. P., Valarmathi, M. L., & Sivakumari,
method, Gain Ratio is the best method. Which selects the S. GAIN RATIO BASED FEATURE SELECTION
best attributes. METHOD FOR PRIVACY PRESERVATION.
We have concentrated on some experiments in order to [13] http://en.wikipedia.org/wiki/R_(programming_langua
evaluate the performance and usefulness of different ge)
feature selection method for selecting the best attributes
[14]. [14] Mar Pal, A. K., and Pal, S. 2013. Analysis and Mining
of Educational Data for Predicting the Performance of
The result of the comparative study of three different Students. IJECCE 4(5).
feature selection method are described in table Shown in
Table 2.
6. CONCLUSION
In this paper, we carried out an analysis study of 3 Feature
selection methods. According to the subset selection
method the best attributes have been selected. Based on the
time complexity we conclude that Gain ratio is the best
method for feature selection, which takes the less time
compare to chi-squared and information gain method. In
future, this approach have been used for further use. This
method can be applied to different type of data sets which
are required for dimensionality reduction and classification
7. REFERENCES
[1] Baker, R. S., and Yacef, K, "The state of educational
data mining in 2009: A review and future
visions." JEDM-Journal of Educational Data
Mining1.1 (2009): 3-17
[2] Barracosa, J. I. M. S. 2011. Mining Behaviors from
Educational Data.
[3] Anal, A., and Devadatta, S. 2014. Early Prediction of
students using Machine Learning Techniques. IJCA
International Journal of Computer Applications,
107(1).
[4] Ramaswami, M., and Bhaskaran, R. 2009. A study on
feature selection techniques in educational data
mining. arXiv preprint arXiv:0912.3924.
[5] Harb, H. M., and Moustafa, M. A. 2012.
SelectingOptimal Subset of Features for
StudentPerformance Model. International Journal of
Computer Science Issues (IJCSI), 9(5).
[6] Baradwaj, B. K., and Pal, S. 2012. Mining educational
data to analyze students' performance. arXiv preprint
arXiv:1201.3417.
[7] http://en.wikipedia.org/wiki/Feature_selection
[8] Miller, A. 2012. Subset selection in regression. CRC
Press.
[9] Koller, D., and Sahami, M. 1996. Toward optimal
feature selection.
[10] http://nlp.stanford.edu/IRbook/html/htmledition/chi-
square-feature-selection-1.html
[11] Azhagusundari, B., and Thanamani, A. S. Feature
selection based on information gain. International
Journal of Innovative Technology and Exploring
Engineering (IJITEE) ISSN, 2278-3075.
51 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security
National Conference on Research Issues in Image Analysis & Mining Intelligence (NCRIIAMI-2015)
52 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security
National Conference on Research Issues in Image Analysis & Mining Intelligence (NCRIIAMI-2015)
especially for protocols that use periodic updates of routing hereafter, node S will be referred to as the sender and node D
tables. They proposed using Dynamic Source Routing (DSR), as the destination. A node, say X, on receiving a route request
which is based on on-demand route discovery. A number of message, compares the desired destination with its own
protocol optimizations are also proposed to reduce the route identifier. If there is a match, it means that the request is for a
discovery overhead. The AODV (Ad hoc On demand route to itself (i.e., node X). Otherwise, node X broadcasts the
Distance Vector routing) protocol that also uses a demand- request to its neighbors to avoid redundant transmissions of
driven route establishment procedure. TORA (Temporally- route requests, a node X only broadcasts a particular route
Ordered Routing Algorithm) [6] is designed to minimize request once (repeated reception of a route request is detected
reaction to topological changes by localizing routing-related using sequence numbers).Figure 1 illustrates this algorithm. In
messages to a small set of nodes near the change. Hass and this figure, node S needs to determine a route to node D.
Pearlman [7] attempt to combine proactive and reactive Therefore, node S broadcasts a route request to its neighbors.
approaches in the Zone Routing Protocol (ZRP), by initiating When nodes B.
route discovery phase on-demand, but limiting the scope of
the proactive procedure only to the initiators local
neighborhood. Recent papers present comparative
performance evaluation of several routing protocols [7].
The previous MANET routing algorithms do not
take into account the physical location of a destination node.
In this paper, we propose two algorithms to reduce route
discovery overhead using location information. Similar ideas
have been applied to develop selective paging for cellular
PCS (Personal Communication Service) networks [8].In
selective paging, the system pages a selected subset of cells
close to the last reported location of a mobile host. This
allows the location tracking cost to be decreased. We propose
and evaluate an analogous approach for routing in MANET.
In a survey of potential applications of GPS, The location
information in ad hoc networks, though they do not elaborate
on how the information may be used. Other researchers have
also suggested that location information should be used to
improve (qualitatively or quantitatively) performance of a Fig 1: Illustration of flooding
mobile computing system. A packet radio system using
location information for the routing purpose. 3.2 Preliminaries
The network infrastructure consists of fixed base
Location information The proposed approach is
stations whose precise location is determined using a GPS
termed Location-Aided Routing(LAR), as it makes use of
receiver at the time of installation. A geographically based
location information to reduce routing overhead. Location
routing scheme to deliver packets between base stations.
information used in the LAR protocol may be provided by the
Thus, a packet is forwarded one hop closer to its final
Global Positioning System (GPS) [10]. With the availability
destination by comparing the location of packets destination
of GPS, it is possible for a mobile host to know its physical
with the location of the node currently holding the packet. A
location.3 In reality, position information provided by GPS
routing and addressing method to integrate the concept of
includes some amount of error, which is the difference
physical location(geographic coordinates), into the current
between GPS-calculated coordinates and the real coordinates.
design of the Internet, has been investigated in [8]. Recently,
For instance, NAVSTAR Global Positioning System has
another routing protocol using location information has been
positional accuracy of about 50100 m and Differential GPS
proposed [3]. This protocol, named DREAM, maintains
offers accuracies of a few meters [9].
location information of each node in routing tables and sends
The proposed algorithm metrics is based on the two
data messages in a direction computed based on these
essential parameters: bandwidth and load. Two additional
routing(location) tables. To maintain the location table
parameters: hop numbers and delay, are optional. In this way
accurately, each node periodically broadcasts a control packet
optimal usage of network resources, as well as possibility for
containing its own coordinates, with the frequency of
guaranteed QoS should be achieved. The logic of proposed
dissemination computed as a function of the nodes mobility
routing protocol is presented by the block scheme inThe main
and the distance separating two nodes (called the distance
idea is to divide routing protocol into two phases:
effect). Unlike [3], we suggest using location information for
(a) As fast as possible association of a new user,
route discovery, not for data delivery.
3. LOCATION-AIDED ROUTING (LAR) (b) Finding the optimal route for a new user, considering its
request
PROTOCOLS
3.1 Route discovery using flooding
In this paper, we explore the possibility of using
location information to improve performance of routing
protocols for MANET. As an illustration, we show how a
route discovery protocol based on flooding can be improved.
The route discovery algorithm using flooding is described
next(this algorithm is similar to DSR [8] and AODV
[9]).When a node S needs to find a route to node D, node S
broadcasts a route request message to all its neighbors2
53 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security
National Conference on Research Issues in Image Analysis & Mining Intelligence (NCRIIAMI-2015)
54 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security
National Conference on Research Issues in Image Analysis & Mining Intelligence (NCRIIAMI-2015)
Parameters Value
Simulator GlomoSim
Protocol Studied AODV, DSR, LAR
Fig 6: The performance of the routing protocols
Simulation Time 50s, 100s, 150s, 200s, 250s,
Blocking Iterations
Terrain Dimension 2000*2000
No.of Nodes 25, 50, 100, 150, 200 After analyzing all the presented results we can
Node Placement Random conclude that the proposed routing algorithm based on the
Node Mobility model RWP artificial neural network could be efficiently used in wireless
mesh networks. In more than 96% cases of intensive
Bandwidth 2 MBs simulations under different and randomly changed network
Traffic Type CBR and traffic conditions, the proposed protocol was better than
4.2.1. Average Hop Count AODV and OSPF. It must be stressed that the NN simulation
Average hop count (measured as the number of hops) is an was performed by digital computer, meaning, by working in
average number of used particular links in the route. This sequential mode which is unnatural for neural networks. It is
metric indicates how good the routing protocol is regarding well-known that neural networks exhibit significantly better
the network resources. Sometimes, routes with minimal hop results in their hardware realization when the natural parallel
count need not be optimal from the point of all the observed mode of operation is possible, producing dramatically shorter
network parameters. However, as less links are involved in execution times, but in our research we concentrated on the
the final route, the selection is better. In our case, the architecture and organization of a new routing protocol which
proposed NN algorithm has the smallest average hop count includes the artificial intelligence in the routing process.
(Figure 20), due to the fact that the NN can find the Pareto 4.2.3. Packet Delivery Ratio
optimal route and then the network resources are optimally The packet delivery ratio (PDR) is defined as the ratio
used. between the packets that are received and the number of
packets sent. This is one of the most used metrics for protocol
55 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security
National Conference on Research Issues in Image Analysis & Mining Intelligence (NCRIIAMI-2015)
comparison. In our simulations the proposed NN model for describing the performances of the routing algorithms. It is
exhibits the best performance, then the OSPF, as depicted in shown that the proposed routing protocol has better or the
same performance in all metrics except one, even though the
neural network is simulated by a digital computer. Our
protocol is scalable and it is adapted for dynamic network
topology and real network environments. We believe that the
performance of the proposed routing protocol can be further
improved if we use multi-radio or multi-channel routing. We
will explore this option in our future work. Also, hardware
realization of the neural network will significantly improve
the execution time for the optimization and decision
processes.
6. REFERENCES
[1] Akyildiz, S.M. Joseph and Yi-Bing Lin, Movement-
based location update and selective paging for PCS
Fig 7: The performance of the routing protocolsPacket
networks, IEEE/ACM Transactions on Networking 4
delivery ratio
(1996) 94104.
4.2.4 Throughput [2] C. Alaettinoglu, K. Dussa-Zieger, I. Matta and A.U.
The throughput between two nodes is expressed as the Shankar, MaRS users manual version 1.0, Technical
number of bytes delivered per unit of time. Report TR 91-80, The University of Maryland (June
Formally: Throughput=Total bytes received/Total time 1991).
The throughput (measured by bytes per seconds) we have [3] S. Basagni, I. Chlamtac, V.R. Syrotiuk and B.A.
calculated as a function of the traffic load (expressed as the Woodward, A distance routing effect algorithm for
number of equal-sized packets per second), for different mobility (DREAM), in: Proc. Of ACM/IEEE
routing protocols, and results are depicted in Figure 18. The MOBICOM 98 (1998).
proposed NN algorithm gives significantly better results than [4] J. Broch, D.A. Maltz, D.B. Johnson, Y.-C. Hu and J.
AODV and OSPF. This is based on fact that the proposed Jetcheva, A performance comparison of multi-hop
algorithm is tailored for finding the optimal routes and thus wireless ad hoc network routing protocols, in: Proc. of
increasing the number of packets which reach the destination. MOBICOM 98 (1998).
[5] S. Corson, S. Batsell and J. Macker, Architectural
considerations for mobile mesh networking (Internet
draft RFC, version 2), in: Mobile Ad-hoc Network
(MANET) Working Group, IETF (1996).
[6] S. Corson and A. Ephremides, A distributed routing
algorithm for mobile wireless networks, Wireless
Networks (1995) 6181.
[7] S. Corson and J. Macker, Mobile ad hoc networking
(MANET): Routing protocol performance issues and
evaluation considerations (Internet-draft), in: Mobile Ad-
hoc Network (MANET) Working Group, IETF (1998).
[8] S.R. Das, R. Castaneda, J. Yan and R. Sengupta,
Comparative performance evaluation of routing
Fig 8: The performance of the routing protocols protocols for mobile, ad hoc networks, in: Proc. of IEEE
Throughput IC3N 98 (1998).
5. CONCLUSION [9] LuoJunhai, Ye Danxia, et al., Research on topology
This paper describes how location information may discovery for IPv6 networks, IEEE, SNPD 2007 3
beused to reduce the routing overhead in ad hoc networks. We (2007)804809.
present two location-aided routing (LAR) protocols. These [10] S. Toumpis, Wireless ad-hoc networks, in: Vienna
protocols limit the search for a route to the so-called request Sarnoff Symposium, Telecommunications Research
zone, determined based on the expected location of the Center, April2004.
destination node at the time of route discovery. Simulation [11] O. Tariq, F. Greg, W. Murray, On the effect of traffic
results indicate that using location information results in model to the performance evaluation of multicast
significantly lower routing overhead, as compared to an protocols in MANET, Canadian Conference on Electrical
algorithm that does not use location information. We also and Computer Engineering (2005) 404407.
suggest several optimizations on the basic LAR schemes [12] X. Chen, J. Wu, Multicasting Techniques in Mobile Ad-
which may improve performance. Further work is required to hoc Networks, Computer Science Department, South
evaluate efficacy of these optimizations, and also to develop West Texas State University, San Marcos.
other ways of using location information in ad hoc networks, [13] Sarkar, S.K.; Basavaraju, T.G.; Puttamadappa, C. Ad
for instance to improve performance of reactive algorithms Hoc Mobile Wireless Networks: Principles, Protocols
such as TORA [27,28], or to implement location-based and Applications; Auer Bach Publications: New York,
multicasting. A lot of implementations of wireless mesh NY, USA, 2007
networks and their significant advantages over competing [14] Hamid, R.S.; Ulman, R.; Swami, A.; Ephremides, A.
technologies need every day improvements from the point of Wireless Mobile Ad Hoc Networks; Hindawi Publishing
QoS. The main part of the desired quality and network Corporation: New York, NY, USA, 2007
stability is the routing protocol. [15] Somya Jain, Deepak Aggarwal, Performance Evaluation
The proposed routing protocol is compared with of Routing Protocols for MAC Layer Models, IOSR
two well-known routing protocols. We used several metrics Journal of Computer Engineering (IOSR-JCE) e-ISSN:
56 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security
National Conference on Research Issues in Image Analysis & Mining Intelligence (NCRIIAMI-2015)
57 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security
National Conference on Research Issues in Image Analysis & Mining Intelligence (NCRIIAMI-2015)
58 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security
National Conference on Research Issues in Image Analysis & Mining Intelligence (NCRIIAMI-2015)
4.1. Confidentiality
In many corporate sectors, businesses are conducted not only
within the confines of their physical boundary but also outside
their buildings. They have employees such as mobile
Fig 1: Reference Architecture workers, who can travel and work at different geographical
areas to conduct transactions. In order to conduct transactions,
Apart from the Reference Architecture for Mobile database, they need corporate sensitive data with them. So the
there are three types of architectures discussed in the literature organizations permit the mobile workers to carry their
namely Client/Server architecture, Server/Agent/Client business data in their mobile devices so that the productivity
architecture and Peer-Peer architecture[9]. Client-Server and the revenue could be increased. While hoarding the data
architecture consists of a server, fixed hosts and mobile in their device, the data should be properly protected
Clients. In Server/Agent/Client architecture, an agent is otherwise it will be exposed to the outsiders including their
included either in the server or in the client side. An agent is a competitors. Therefore, the confidentiality of the database that
computer program that can achieve a series of goals of resides in the mobile device is the major concern for such type
designers and users, and can freely and autonomously operate of organizations.
in the mobile environment. In peer-peer architecture, clients
may also communicate with other clients to share their data. The various encryption techniques used for securing mobile
database is well explained in paper[10]. The necessity for
encryption is strongly suggested in [11], to diminish the risk
3.2. Advantages of intentional or accidental disclosure of sensitive data in
9 Offline access to data the user can read and update the portable devices. Symmetric (or Private Key) encryption and
data without a network connection Asymmetric (or Public key) encryption are the two broader
9 Overcome the problems like dropped connections, low classes of cryptographic algorithms. Compared to Private key
bandwidth and high latency which are common to encryption, Public key encryption is highly computationally
Wireless Networks intensive. Therefore, for resource constraint mobile devices,
9 Increase the battery life of the device, by not using the Symmetric key algorithms are more suitable [12]. A new
modem all the time technique C-SDA (Chip- Secured Data Access) is
9 Reduction in Application cost by reducing the length of recommended in paper [13]. It is a client-based security
Network connection time component acting as an incorruptible mediator between a
9 Wireless airtime fees can be reduced by only mobile client and an encrypted database. This component is
synchronizing the updated data embedded into a smart card to prevent any alteration to occur
on the client side. Light weight encryption algorithms such as
Random number Addressing Cryptography[14] and Scalable
3.3. Challenges Encryption Algorithm[15] are also found in the literature for
light weight devices. These two algorithms are co-designed
using both hardware and software.
59 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security
National Conference on Research Issues in Image Analysis & Mining Intelligence (NCRIIAMI-2015)
60 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security
National Conference on Research Issues in Image Analysis & Mining Intelligence (NCRIIAMI-2015)
a person found to be involved in such activities should be 7. Whenever the device is lost / thefts immediately inform it
penalized. to the organization, so that data stored inside the device can be
deleted through Remote Wiping.
Implementing strong Security Policies, Installing Monitoring 8. Ensure that Bluetooth, WiFi enabled mobile devices are
Software and educating the employees are the possible turned off when they are not used.
solutions to overcome these issues. 9. Ensure that the network path is secured between Mobile
Client and the Server.
Table 1. Problems and Appropriate Solutions for Mobile 10. Ensure that anti-virus software and Firewalls are installed
Database Security in the mobile devices and updated regularly.
11. Ensure that the mobile device policies are established in
Problems Solutions the organization and the users are informed about the
1. Unauthorized disclosure 1. Light Weight Encryption importance of policies and the means of protecting their
of the data algorithm information.
2. Strong Encryption algorithm 12. Ensure that the mobile device uses pattern to lockout the
2. Unauthorized access of Password / PIN / Token / device.
the data Biometric factors like
Fingerprint, Iris recognition, 6. CONCLUSION
Voice recognition etc. A study by Gartner found that 45 percent of workers in the
3. Unauthorized 1. Grant / Revoke command United States work outside the traditional office at least eight
modification of the data hours per week. By 2014, 90 percent of organizations will
support corporate applications on personal devices[18].
4. Destruction of Sensitive Backup / Recovery procedure Therefore, there is a need to analyze the problems and identify
data due to the device the appropriate solutions for securing Mobile Database. It is
corruption also important to continue to explore the problems faced by
5. Unauthorized disclosure 1. PIN/Password/Pass database systems in the mobile environment.
of data due to device code/pass pattern
lost/theft 2. Remote Wiping
7. REFERENCES
6. Interception of data Turn off Bluetooth and WiFi [1] Ouri Wolfson, 2009. Mobile Database, Encyclopedia of
through WiFi, Bluetooth facilities. Database Systems. Part 13, pp. 1751,
etc. http://www.springerlink.com/content/n72wu51n4056524g/full
text.html.
7.Unauthorized access and 1. Encryption
modification of the data 2. Server Authentication and
[2] D. Roselin Selvarani and Dr. T. N. Ravi, Issues,
during transmission Client Authentication
Solutions and Recommendations for Mobile Device
3. WPKI Certificate
Security, International Journal of Innovative Research in
8. Problems due to the 1.Anti-virus Software
Technology and Science (IJIRTS), Vol.1, No. 5, pp. 9-14,
malicious software such as 2. Firewall
November 2013, ISSN : 2321- 1156.(online)
viruses or Trojan horses
http://ijirts.org/volume1issue5/IJIRTSV1I5027.pdf
9. Problems due to the Establish the mobile device
employees / insiders policies and enforcing penalty
[3] Parviz Ghorbanzadeh, Aytak shaddeli, Roghieh
- BYOD for the workers who disobey
Malekzadeh, Zoleikha Jahanbakhsh, A Survey of Mobile
the policies.
Database Security Threats and Solutions for It,3rd
International Conference on Information Sciences and
Interaction Sciences (ICIS), pp.676-682, 2010.
5. SUGGESTIONS FOR
[4] Tao Zhang, Shi Xing-jun, Status Quo and Prospect on
ORGANIZATIONS Mobile Database Security, Telkomnika, Vol. 11, No. 9, pp.
1. Ensure that data stored in the mobile devices are encrypted 4949-4955, September 2013.
using Light Weight Encryption Algorithm.
2. Ensure that all the data stored in the Removable drive such
[5] A.R. Bhagat, V.B. Bhagat, Mobile Database Review and
as USB flash drive, memory card are also encrypted properly, Security Aspects, International Journal of Computer Science
so that it will not be revealed to the unknown person in case of and Mobile Computing, Vol.3 Issue.3, pp. 1174-1182, March
loss. 2014.
3. Ensure that proper Authentication mechanism is
incorporated to determine and verify the users identity.
[6] Sanket Dash, Malaya Jena, In the Annals of Mobile
4. Ensure that appropriate Access Control mechanisms are Database Security, CPMR-Interntional Journal of
established to protect data integrity by restricting who can Technology, Volume 1, No. 1, December 2011.
alter data.
5. Ensure that periodic backups of mobile devices are done so
[7] Hezal Lopes, Rahul Lopes, Comparative Analysis of
that it can be quickly put into use again whenever any
Mobile Security Threats and Solution, International Journal
destruction occurs. of Engineering Research and Applications, Vol. 3, Issue 5, pp.
6. Ensure that mobile device has lockout facility so that when 499-502, Sep-Oct 2013.
an unauthorized person tries to enter PIN/Password/ Pass code
unsuccessfully for more than 5 times, the device will be
automatically locked out.
61 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security
National Conference on Research Issues in Image Analysis & Mining Intelligence (NCRIIAMI-2015)
[8] Vijay Kumar, Mobile Database Systems, Wiley- [13] Luc Bouganim, Philippe Pucheral, 2002. Chip-Secured
Interscience, John Wiley & Sons, Inc. Publication, New Data Access: Confidential Data on Untrusted Servers.
Jersey, 2006. Proceedings of the 28th VLDB Conference, Hong Kong,
China.
[9] D. Roselin Selvarani, Dr. T.N. Ravi, A Survey on Data
and Transaction Management in Mobile Databases, [14] Masa-aki Fukase, Tomoaki Satot, Innovative Ubiquitous
International Journal of Database Management Systems Cryptography and Sophisticated Implementation, pp.364-
(IJDMS), Vol.4, No.5, pp. 1-20, October 2012, ISSN: 0975- 369, ISCIT 2006, IEEE, 2006.
5705, 0975-5985. DOI: 10.5121/ijdms.2012.4501.
[15] B. Praveen Kumar, P. Ezhumalai, P. Ramesh, Dr. S.
[10] D. Roselin Selvarani and Dr. T. N. Ravi , A Review on Sankara Gomathi , Dr . P. Sakthivel, Improving the
the role of Encryption in Mobile Database Security, Performance of a Scalable Encryption Algorithm (SEA) using
International Journal of Application or Innovation in FPGA, International Journal of Computer Science and
Engineering & Management (IJAIEM), Volume 3, Issue 12, Network Security, Vol.10, No.2, February 2010.
pp. 76-83, December 2014, ISSN 2319 4847.
[16] The Impact of Mobile Devices on Information Security:
[11] Tao Zhang, Shi Xing-jun, Status Quo and Prospect on A Survey of IT Professionals, Dimensional Research |
Mobile Database Security, Telkomnika, Vol. 11, No. 9, pp. January 2012. www.dimensionalresearch.com
4949-4955, September 2013.
[17]http://pewinternet.org/Reports/2011/Smartphones/Summa
[12] Faith M. Heikkila, Encryption: Security Considerations ry.aspx
for Portable Media Devices, IEEE Security and Privacy,
2007. [18]https://www.usa.canon.com/CUSA/assets/app/pdf/ISG_S
ecurity/CanonQuickPulse_ITWorld_Whitepaper.pdf
62 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security
National Conference on Research Issues in Image Analysis & Mining Intelligence (NCRIIAMI-2015)
63 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security
National Conference on Research Issues in Image Analysis & Mining Intelligence (NCRIIAMI-2015)
64 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security
National Conference on Research Issues in Image Analysis & Mining Intelligence (NCRIIAMI-2015)
size. The following table 4.1 shows the parameters used in our
scenario.
Table 4.3 - Different Packet Size
Table 4.1 - Parameters used in our Scenario
PARAMETERS VALUES
65 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security
National Conference on Research Issues in Image Analysis & Mining Intelligence (NCRIIAMI-2015)
66 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security
National Conference on Research Issues in Image Analysis & Mining Intelligence (NCRIIAMI-2015)
BOOKS
[15].IndustrialWirelessSensornetworks,Applications,protoc
ols, & Standards ,V.Cagrigungor& Gerhard P.Hancke.
67 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security
National Conference on Research Issues in Image Analysis & Mining Intelligence (NCRIIAMI-2015)
68 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security
National Conference on Research Issues in Image Analysis & Mining Intelligence (NCRIIAMI-2015)
Do not send sensitive business files to personal email Backdoor An application or service that permits remote
addresses. access to an infected computer. It opens up a so-called
Have suspicious/malicious activity reported to security backdoor to circumvent other security mechanisms.
personnel immediately. Adware Software that displays banner ads or pop-ups
Secure all mobile devices when traveling, and report lost or when a computer is in use. The presence of adware is likely if
stolen items to the technical support for remote dubious offers are displayed as pop-ups or banner ads even
kill/deactivation. when you are visiting a reputable website and have a pop-up
Educate employees about phishing attacks and how to report blocker enabled. Even though adware is not classified as
fraudulent activity. harmful malware, many users regard it as irritating and
intrusive. Adware can often have undesired effects on a
1.3 How Computers Work system, even interrupting the Internet connection or system
Hardware All of a computers physical components
operations.
including the mouse, keyboard, screen, and printer as well as
Trojans From Greek legend of the Trojan horse. In the
internal parts like the processor and hard drive.
world of computers, it refers to covert infiltration by
Operating system Creates the connection between the
malicious software under the guise of a useful program. After
computers hardware and the application software employed
a Trojan is activated, it is often very difficult to discover the
by the user. Common operating systems include Microsoft
extent of the damage and generally identify the malware. The
Windows, Mac OS and Linux.
Trojan may change its original name and reactivate every time
Software A set of instructions that cause the computer to
a PC is started
perform certain tasks; can be divided into two types: system
software and application software 2.1 Protection
Browsers Programs that look through content published Anti-virus software Software that detects and removes
on the Internet and display Internet pages. The most viruses.
commonly used browsers are Microsoft Explorer Firewalls A personal firewall is a program that works on a
and Mozilla Firefox. PC as a protective filter for data communication in a
potentially dangerous network such as the Internet.
2. VULNERABILITIES
3. ENCRYPTION TECHNOLOGY
Malware The name given to malicious software that Another factor that can complicate the investigation
operates under the guise of a useful software program. It runs of cybercrime is encryption technology, which protects
computer processes that are either unexpected or unauthorized information from access by unauthorized people and is a key
but always harmful[14]l. The term malware generally technical solution in the fight against cybercrime. Encryption
covers viruses, worms and Trojans. is a technique of turning a plain text into an obscured format
Viruses Software with the ability to self-replicate and by using an algorithm. Like anonymity, encryption is not new,
attach itself to other executable programs. The behavior is but computer technology has transformed the field. For a long
comparable to its biological counterpart. Computer viruses time it was subject to secrecy. In an interconnected
can also be contagious (might spread on or even beyond the environment, such secrecy is difficult to maintain.
infected computer), exhibit symptoms (the presence of The widespread availability of easy-to-use software
malicious code and its magnitude) and involve a recovery tools and the integration encryption technology in the
period with possible long-term effects (difficulty in removal operating systems now makes it possible to encrypt computer
and loss of data). data with the click of a mouse and thereby increases the
Worm An autonomous program or constellation of chance of law-enforcement agencies being confronted with
programs that distributes fully functional whole or parts of encrypted material. Various software products are available
itself to other computers. Worms are specialists in spreading that enable users to protect files against unauthorized access.
and reproducing. They consistently exploit all known But it is uncertain to what extent offenders already use
vulnerabilities, including people, to penetrate barriers that encryption technology to mask their activities. One survey on
seem to be impenetrable to normal viruses. A worm does not child pornography suggested that only 6 per cent of arrested
have a payload of its own but is often used as a transport child-pornography possessors.
mechanism for viruses that ride piggyback and immediately Used encryption technology, but experts highlight
start their work. the threat of an increasing use of encryption technology in
Gray ware Applications that cause annoying behavior in cybercrime cases. There are different technical strategies to
the way programs run. Unlike malware, gray ware does not cover encrypted data and several software tools are available
fall into the category of major threats. Gray ware is not to automate these processes. Strategies range from analyzing
detrimental to basic system operations. weakness in the software tools used to encrypt files, searching
Spyware Software that installed under misleading for encryption passphrases and trying typical passwords, to
premises that monitors and collects a users data and complex and lengthy brute-force attacks.
eventually transmits it to a company for various purposes. The term brute-force attack is used to describe the
This typically happens in the background - that is, the activity process of identifying a code by testing every possible
is invisible to most users. combination. Depending on encryption technique and key
Phishing A method of stealing personal data whereby an size, this process could take decades. For example, if an
authentic-looking E-mail is made to appear as if it is coming offender uses encryption software with a 20-bit encryption,
from a real company or institution[15]. The idea is to trick the the size of the key space is around one million. Using a
recipient into sending secret information such as account current computer processing one million operations per
information or login data to the scammer. 2009, Bruce S. second, the encryption could be broken in less than one
Schaeffer, Hen free Chan, Henry Chan, and Susan Ogulnick second. However, if offenders use a 40-bit encryption, it could
Dialers Dialing programs used to dial up an Internet take up to two weeks to break the encryption.
connection using preset and typically overpriced phone In 2002, the Wall Street Journal was for example
numbers. able to successfully decrypt files found on an Al Qaeda
69 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security
National Conference on Research Issues in Image Analysis & Mining Intelligence (NCRIIAMI-2015)
70 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security
National Conference on Research Issues in Image Analysis & Mining Intelligence (NCRIIAMI-2015)
71 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security
National Conference on Research Issues in Image Analysis & Mining Intelligence (NCRIIAMI-2015)
72 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security
National Conference on Research Issues in Image Analysis & Mining Intelligence (NCRIIAMI-2015)
efficient. This is possible in the case of tracing a data packet 4.2 Conditional Legitimate Probability
within the boundary of the ISP. In the case of Internet as such, The most challenging issue in blocking DDoS
it is impractical to trace back the illegitimate data packets. attacks is to distinguish attack packets from legitimate ones.
To resolve this problem, we utilize the concept of Conditional
4. PROPOSED METHODOLOGY Legitimate Probability (CLP) for identifying attack packets
Recently, have proposed a defense scheme based on probabilistically. The CLP is produced by comparing traffic
distributed detection and automated on-line attack characteristics during the attack with previously measured,
characterization [4]. The proposed scheme consists of the legitimate traffic characteristics. The viability of this approach
following 3 phases: is based on the premise that there are some traffic
y Detect the onset of an attack and identify the victim characteristics that are inherently stable during normal
by monitoring four key traffic statistics of each protected network operations of a target network.
target while keeping minimum per-target states. To formalize the concept of CLP, consider all the
y Differentiate between legitimate and attacking packets destined for a DDoS attack target. Each packet would
packets destined towards the victim based on a readily- carry a set of discrete-value attributes A;B;C;.... For example,
computed, Bayesian-theoretic metric of each packet. The A might be the protocol type might be the packet size, C
metric is of each packet called Conditional Legitimate might be the TTL values, etc. We defined {a1,a2,a3} as
Probability (CLP). possible values for attribute A {b1,b2,b3.} as possible
y Discard packets selectively by comparing the CLP of values for attribute B and so on. During an attack there are Nn
each packet with a dynamic threshold. The threshold is legitimate packets and Na attack packets arriving in T seconds
adjusted according to (1) the distribution of all suspicious totaling Nm.
packets and (2) the congestion level of the victim. Nm=Nn+Na
One of the key concepts in Packet Score is the (m for measured, n for normal, and a for attack)
notion of Conditional Legitimate Probability (CLP) based Cn ( A = ai) = the number of Legitimate packets with attribute
on Bayesian theorem. CLP indicates the likelihood of a packet value ai for attribute A
being legitimate by comparing its attribute values with the Nn = Cn ( A = a1 ) + Cn ( A = a2 ) ++ Cn ( A = ai ) +.
values in the baseline profile. Packets are selectively Cn ( B = b1 ) + Cn ( B = b2 ) ++ Cn ( B = bi ) +.
discarded by comparing the CLP of each packet with a
Na= Ca ( A = a1 ) + Ca ( A = a2 ) ++ Ca ( A = ai ) +.
dynamic threshold. The concept of using a baseline profile
with Bayesian theorem has been used previously in anomaly- Ca( B = b1 ) + Ca ( B = b2 ) ++ Ca ( B = bi ) +.
based IDS (Intrusion Detection System) applications [5], Nm = Cm ( A = a1 ) + Cm ( A = a2 ) ++ Cm ( A = ai ) +.
where the goals are generally attack detection rather than real-
Cm ( B = b1 ) + Cm ( B = b2 ) ++ Cm( B = bi ) +.
time packet filtering.
In this research, the basic concept to a practical real- Pn the ratio or the probability of attribute values among the
time packet filtering scheme using elaborate processes is legitimate packets
extended. In this method, the Packet Score operations for The Conditional Legitimate Probability (CLP) is
single-point protection are described, but the fundamental defined as the probability of a packet being legitimate given
concept can be extended to a distributed implementation for its attributes:
core-routers. The concept of Conditional Legitimate CLP (packet p) = P (packet p is legitimate | ps
Probability (CLP) is described. Focus on the profiling of attribute A = a p , attribute B = b p ....) .
legitimate traffic characteristics is made. Score assignment to
According to Bayes theorem, the conditional probability of
packets, selective discarding, and overload control are
an event E to occur given an event F is defined as :
described. The performance of the standalone packet filtering
P( E F )
scheme is evaluated. Some of the important issues related to
P( E | F ) = .
the Packet Score scheme are discussed. Finally, the direction P( F )
of future investigation is concluded.
Therefore, CLP can be rewritten as follows :
N n xPn ( A = a p , B = b y ,...) / N m
4.1. Scoring Packets CLP = .... (1)
The Packets are scoring Statistics in Conditional N m xPm ( A = a p , B = b p ,...) / N m
Legitimate Probability Based on Bayesian Theorem. The then (1) can be further rewritten as :
concept of using a baseline profile with Bayesian theorem has
N n xPn ( A = a p ), Pn ( B = b p ) x...
been used previously in anomaly-based IDS (Intrusion
CLP( p ) = (2)
Detection System) applications, where the goals are generally N m xPm ( A = a p ) xPm ( B = b p ) x...
attack detection rather than real-time packet filtering.
One of the terms in (2), e.g.,
Pn ( A = a p ) / Pm ( A = a p ) is called a partial score.
73 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security
National Conference on Research Issues in Image Analysis & Mining Intelligence (NCRIIAMI-2015)
74 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security
National Conference on Research Issues in Image Analysis & Mining Intelligence (NCRIIAMI-2015)
profiles in Fig.1b. However, in fig. 1d, when compared with scorebooks increases. On the other hand, generating a log-
other sites, the SL is much lower, showing a mush weaker version scorebook may take longer than a regular scorebook
correlation. generation. However, the scorebook is generated only once at
the end of each period and it is not necessary to observe every
packet for scorebook generation; thus, some processing delay
can be allowed. Furthermore, scorebook generation can be
easily parallelized using two processing lines, which allows
complete sampling without missing a packet. Into subtraction
/addition operation. With the help of the logarithmic version
of (2) as shown below:
75 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security
National Conference on Research Issues in Image Analysis & Mining Intelligence (NCRIIAMI-2015)
76 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security
National Conference on Research Issues in Image Analysis & Mining Intelligence (NCRIIAMI-2015)
output utilization _out against the target utilization _target Table 6. Performance against Different Types of Attack.
were compared. The fig-4 shows a typical score distribution
trend over 25 periods under a DDoS attack. The attack
packets (high peaks on the upper left side) are concentrated in
the lower score region while legitimate packets (lower right
side) have higher scores. The black bar between the two
groups represents the cutoff threshold scores for the discard
decision, which removes the majority of the attack traffic.
Packet Score tracks the score distribution change in each
period and adjusts the cutoff threshold accordingly.
However, Packet Score is designed to accept
specific traffic only up to the maximum ratio that was
observed in the past. Therefore, DDoS traffic beyond this
ratio will be properly filtered by Packet Score and, thus,
cannot succeed in a massive attack. Nevertheless, accepting
DDoS traffic is not desirable as it wastes bandwidth. This
situation can be improved by constructing a cleaner nominal
profile that contains less DDoS traffic. A cleaner profile can
be made one of two ways.
First, the packet trace data can be analyzed to
identify legitimate flows that show proper two-way
Fig 4: Score distribution over time communication behavior. The packets from the legitimate
flows are used for constructing the profile. Although some
6.2 Performance under Different Attack traffic flows that do not have continuous packet exchange,
such as ICMP, may be left out, Packet Score filtering is
Types already based on an iceberg-style histogram with default value
The results are described in Table 5. In most cases,
assignment for no iceberg items. The impact on performance
RA and RL were above 99 percent and false positives were
of missing some of the packets should be minimal.
kept low. The result was substantially better than random
Second, a generic profile to remove those packets
packet dropping of which the false positive ratio is expected
that are more likely to be attack packets can be used first, and
to be 90.7 percent, and better than some of previous methods.
use the remaining packets to create the final nominal profile.
Furthermore, pout was successfully kept close to its target
The generic profile reflects overall Internet traffic
value in most cases. The false negative ratios were mainly due
characteristics, e.g., TCP versus UDP ratio, common packet
to the gap between ptarget and plegitimate. Although the
size, common TCP flags, etc. Our preliminary research shows
signature of the TCP-SYN flood packets can easily be derived
that this two-step profiling is very effective to fight generic
by any filtering scheme, the ability of PacketScore to
attacks. Further study is needed on these methods.
prioritize legitimate TCP-SYN packets over attack packets
based on other packet attributes is an essential feature.
7. CONCLUSION & FUTURE SCOPE
Without such prioritization PacketScore did show some
The Packet Score scheme is used to defend against
degradation under nominal attack when the TTL values were
DDoS attacks. The key concept in Packet Score is the
randomized. Legitimate packets having the same
Conditional Legitimate Probability (CLP) produced by
characteristics as attack packets were penalized, but such
comparison of legitimate traffic and attack traffic
chances were still quite small, and the false positive ratio was
characteristics, which indicates the likelihood of legitimacy of
kept to 3.30 percent. When the TTL values were fixed, the
a packet. As a result, packets following a legitimate traffic
performance was better. The TTL value 118 accounted for the
profile have higher scores, while attack packets have lower
largest portion of traffic (29.86 percent) and 100 had a ratio
scores. This scheme can tackle never-before-seen DDoS
under the adaptive threshold (0.29 percent) for 99 percent
attack types by providing a statistics-based adaptive
coverage. As PacketScore utilizes more attributes, the
differentiation between attack and legitimate packets to drive
performance should become better. Only one joint attribute
selective packet discarding and overload control at high-
was used in this experiment, numerous combinations of
speed.
attributes are possible to increase the number of attributes, and
Packet Score is capable of blocking all kinds of
the performance under nominal attack can be further
attacks as long as the attackers do not precisely mimic the
improved. It is interesting to see how the scores of each attack
sites traffic characteristics. The performance and design
type are distributed.
tradeoffs of the proposed packet scoring scheme in the context
of a stand-alone implementation is studied. By exploiting the
measurement/scorebook generation process, an attacker may
try to mislead Packet Score by changing the attack types
and/or intensities. Easily overcome such an attempt by using a
77 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security
National Conference on Research Issues in Image Analysis & Mining Intelligence (NCRIIAMI-2015)
smaller measurement period to track the attack traffic pattern [4]. W.C. Lau, M.C. Chuah, H.J. Chao, Y. Kim, PacketScore
more closely. Currently investigating the generalized a proactive defense scheme against Distributed Denial of
implementation of Packet Score for core networks. Service Attacks, NSF proposal under submission.
[5]. M. Mahoney and P.K. Chan, Learning Nonstationary
Packet Score is suitable for the operation at the core
Models of Normal Network Traffic for Detecting Novel
network at high speed, and we are working on an enhanced Attacks, Proc. ACM 2002 SIGKDD, pp. 376-385, 2002.
scheme for core network operation in a distributed manner. In [6]. J. Jung, B. Krishnamurthy, and M. Rabinovich, Flash
particular, we plan to investigate the effects of update and Crowds and Denial of Service Attacks: Characterization and
feedback delays in a distributed implementation, and Implications for CDNs and Web Sites, Proc. Intl World
implement the scheme in hardware using network processors. Wide Web Conf., May 2002.
Second, Packet Score is designed to work best for a large [7]. D. Liu and F. Huebner, Application Profiling of IP
Traffic, Proc. 27th Ann. IEEE Conf. Local Computer
volume attack and it does not work well with low-volume
Networks (LCN), 2002.
attacks. Packet Score performance in the presence of such [8]. D. Marchette, A Statistical Method for Profiling
attack types, e.g., bandwidth soaking attacks [11] or low-rate Network Traffic, Proc. First USENIX Workshop Intrusion
attacks [12] can be explored and improved. Finally, a Detection and Network Monitoring, Apr. 1999.
thorough investigation on the stability of traffic characteristics [9]. NLANR PMA Packet Trace Data,
shall be performed. And ensures the objective of the proposed http://pma.nlanr.net/Traces, 2006.
research work. [10]. S. Kasera et al., Fast and Robust Signaling Overload
Control, Proc. Intl Conf. Network Protocols, Nov. 2001.
[11]. Y. Xu and R. Guerin, On the Robustness of Router-
8. REFERENCES Based Denialof- Service (DoS) Defense Systems, ACM
[1]. D. Moore, G.M. Voelker, and S. Savage, Inferring SIGCOMM Computer Comm. Rev., vol. 35, no. 3, July 2005.
Internet Denialof-Service Activity, Proc. 10th USENIX [12]. Y. Kim, W.C. Lau, M.C. Chuah, and H.J. Chao,
Security Symp., Aug. 2001. PacketScore: Statistics-Based Overload Control against
[2]. H. Wang, D. Zhang, and K.G. Shin, Change-Point Distributed Denial-of- Service Attacks, Proc. IEEE
Monitoring for the Detection of DoS Attacks, IEEE Trans. INFOCOM, Mar. 2004.
Dependable and Secure Computing, vol. 1, no. 4, Oct.-Dec.
2004.
[3]. S. Savage, D. Wetherall, A. Karlin, and T. Anderson,
Network Support for IP Trace back, IEEE/ACM Trans.
Networking, vol. 9, no. 3, June 2001.
78 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security
National Conference on Research Issues in Image Analysis & Mining Intelligence (NCRIIAMI-2015)
ABSTRACT with the wired network, the mobile ad hoc network will need
Mobile Ad-Hoc network is a collection of wireless mobile nodes more robust security scheme to ensure the security of it.
dynamically forming a network without any existing
infrastructure and the relative position dictate communication 1.3 Types of attacks
links. Ad-hoc networks are wireless networks where nodes The nature and structure of networks in MANET makes it
communicate with each other using multi-hop links. Due to attractive to various types of attackers. Different types of
nodal mobility, the network topology may change rapidly and attacker attempts different approaches to decrease the network
unpredictably overtime. Clustered based secure structure has performance, throughput. On the basis of the nature of attack,
evolved as an important research topic in MANET. These papers the attacks against MANET may be classified into Legitimate
discussed the various clustering methods and propose a secured Based and Interaction based attacks.
cluster to improve the performance of large MANETs.
Legitimate based attacks are divided into i) External attacks -
Keywords committed by nodes that are not legal members of the network.
MANET, Cluster, Scalable, attacks, Mobility, Power aware. ii) Internal attacks are from a compromised member inside the
network and they are not easy to prevent or detect. Interaction
Based attacks are again classified into
1 INTRODUCTION
1.1.MANET i)Passive Attack:
The network is an autonomous transitory association of mobile Passive attacks are the attack that does not disrupt
nodes that communicate with each other over wireless links. Proper operation of network.
Nodes that lie within each others send range can communicate Attackers snoop data exchanged in network without
directly and are responsible for dynamically discovering each altering it.
other. In order to able communication between nodes that are Requirement of confidentiality can be violated if an
not directly within each others send range, intermediate nodes attacker is also able to interpret data gathered through
act as routers that relay packets generated by other nodes to their snooping.
destination. These nodes are often energy constrainedthat is, Detection of these attacks is difficult since the
battery-powereddevices with a great diversity in their operation of network itself does not get affected.
capabilities. Furthermore, devices are free to join or leave the ii)Active Attack :
network and they may move randomly, possibly resulting in Active attacks are severe attacks which are performed
rapid and unpredictable topology changes. In this energy- by attackers for replicating, modifying and deletion of
constrained, dynamic, distributed multi-hop environment, nodes exchanged data.
need to organize themselves dynamically in order to provide the They try to change the behavior of the protocols.
necessary network functionality in the absence of fixed These attacks meant to degrade or prevent the message
infrastructure or central administration. flow among nodes
79 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security
National Conference on Research Issues in Image Analysis & Mining Intelligence (NCRIIAMI-2015)
we need to keep them secret from all entities that do not have 2 .REVIEW OF LITERATURE
the privilege to access them. Confidentiality is sometimes Mobility-aware clustering [1] indicates that the cluster
called secrecy or privacy architecture is determined by the mobility behavior of mobile
nodes. The idea is that, by grouping mobile terminals with
1.4.3 Integrity similar speed into the same cluster, the intra-cluster links can
Integrity means that assets can be modified only by authorized become more tightly connected. Then the re-affiliation and re-
parties or only in authorized way. Integrity assures that a clustering rate can be naturally decreased.
message being transferred is never corrupted.
In Energy efficient clustering [2] Battery power is very
1.4.4 Authentication important for the mobile nodes in a MANET during operation.
Authentication enables a node to ensure the identity of peer Energy limitation is severe challenge for network performance.
node it is communicating with. It is essentially assurance that Compared to ordinary nodes, cluster head consumes more power
participants in communication are authenticated and not because of its extra work.
impersonators. Authenticity is ensured because only the
legitimate sender can produce a message that will decrypt Any node failure due to energy depletion may cause
properly with the shared key. communication interruption, so it is important to balance the
energy consumption among mobile nodes to avoid node failure,
especially when some mobile nodes bear special tasks or the
1.4.5 Nonrepudiation network density is comparatively sparse.
Nonrepudiation ensures that sender and receiver of a message
cannot disavow that they have ever sent or received such a
Load-balancing clustering algorithm [3] is used to specify the
message. This is helpful when we need to discriminate if a node
optimum number of mobile nodes that a cluster can handle
with some undesired function is compromised or not.
especially in a cluster head based MANET. The system
throughput will be reduced if a too-large cluster may put too
1.4.6 Anonymity heavy load on the cluster head. A too-small cluster, however,
Anonymity means all information that can be used to identify may produce a large number of clusters and thus increase the
owner or current user of node should default be kept private and length of hierarchical routes, resulting in longer end-to-end
not be distributed by node itself or the system software. delay. The load-balancing clustering algorithm limits the
number of mobile nodes for a cluster that a cluster can
effectively handled. When a cluster size exceeds its limit, re-
1.5 Cluster formation of a clustering procedure is invoked to adjust the
The concept of cluster involves taking two or more computers
number of mobile nodes in that cluster.
and organizing them to work together to provide higher
availability, reliability and scalability than can be obtained by
In LowestID clustering [4], each node is assigned a unique ID
using single system.
and broadcast the ID to its entire neighbor node. The node with
lowest ID is selected as cluster head. A node which can
The mobile nodes in a MANET are divided into different virtual
communicate with more than one cluster is known as gateway.
groups, and they are allocated geographically adjacent into the
Mobility-based D-hop Clustering Algorithm (MobDHop)[5]
same cluster according to some rules with different behaviors
dynamically forms variable diameter clusters that are adaptive to
for nodes included in a cluster from those excluded from the
node mobility patterns. It allows the cluster members at most d-
cluster.
hops away from their cluster heads to control the cluster size and
manage cluster head density. The cluster head election is
1.6 Advantages of clustering dependent on three mobility metrics, namely (a) Variation of
First, a cluster structure facilitates the spatial reuse of resources estimated distance between nodes overtime (VD) which
to increase the system capacity. A cluster can better coordinate estimates the relative mobility of two nodes. (b) Local
its transmission events with the help of a special mobile node, Variability (LV) which calculates the mean of VD of all the
such as a cluster head, residing in it. This can save much neighbors for a node (c) Group variability (GV) which
resources used for retransmission resulting from reduced calculates the mean of LV of its 1-hop neighbors. A node that
transmission collision. The second benefit is in routing, because has the lowest variability value (LV) assumes the role of cluster
the set of cluster heads and cluster gateways can normally form head.
a virtual backbone for inter-cluster routing, and thus the
generation and spreading of routing information can be In Trust-based clustering [6] scheme, each node evaluates the
restricted in this set of nodes. Last, a cluster structure makes an trust value of neighbor nodes and recommends one of neighbors
ad hoc network appear smaller and more stable in the view of that have the highest trust value as its trust guarantor. Then
each mobile terminal. When a mobile node changes its attaching recommender node becomes a member of CH node which is
cluster, only mobile nodes residing in the corresponding clusters one-hop away and, the recommender would have to recommend
need to update the information. other nodes again. When nodes recommend a CH, they give a
recommendation certificates called R-Certificate to the CH.
This paper organizes as follows: Section 2 reviews the These certificates are used to authenticate the CH. So, the CH
literatures, Section 3 discusses the proposed method, Section 4 which has many recommendation certificates considered as
concludes the paper and finally Section 5 lists the references. more trustable node in the ad hoc networks. In the
80 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security
National Conference on Research Issues in Image Analysis & Mining Intelligence (NCRIIAMI-2015)
recommendation certificate, the period is the term of validity of Step 2: If it is less, then it send battery low signal to its neighbor
the certification. nodes.
Step 3: Re-calculate the global weight and select the next CH.
Weighed cluster algorithm (WCA) proposed for MANET [7]
elects the CH based on factors like mobility, ability to handle
3.1.4 Mobility calculation phase
nodes, communication range, and so forth. The algorithm
Step 1: Calculate Nodes hop counts from CH between a
calculates the average weight of each node based on the
particular time interval t and t-1.
provided factors. The node with the minimum weight is selected
as a cluster head.
3.1.5 Reputation Index phase(Integrity)
The goal of the Highest-Degree algorithm [8] is to minimize the Step 1: Initialize Reputation Index (RI) value for each node.
number of clusters, which is achieved as follows. Each node is Step 2: Increment RI value of every upstream or relay node if it
aware of the number of its neighbor nodes, which acquired by receives acknowledgement receipt of packets.
interactively exchanging control messages. The node having the Step 3: Decrement RI value if do not receive acknowledge from
highest number of neighbors, i.e. the highest degree, is elected downstream node or it flood the path.
as the cluster-head. Step 4: The node with negative RI value is excluded in the next
route discovery process.
Secured Clustering Algorithm for Mobile Ad Hoc Networks [9] Step 5: Consider the node with negative RI value as selfish node
is a secured weight-based clustering algorithm allowing more or attacker.
effectiveness, protection and trust in the management of cluster
size variation. It includes security requirements by using a trust
value defining how much a node is trusted by its neighbor hood, 4.CONCLUSION
and using the certificate as nodes identifier to avoid any Security has become a primary concern due to the MANET
possible attacks. characteristics like dynamic, infrastructure less, having no
centralized authority, mobility and scalability. This paper
SEEC-Signal and Energy Efficient Clustering algorithm [10] discusses the issues in MANET, compares various clustering
is preventing the cluster head and re-selection of the cluster head schemes and proposed a secured, scalable, energy efficient and
when energy and signal fall to the threshold value. Cluster head mobility based clustering algorithm for MANET.
formation will be done when the energy and signal reach to the
threshold value. The Node which has high energy level and
signal level is elected as the cluster head. 5. REFERENCES
[1].P. Basu, N. Khan, and T. D. C. Little, A Mobility Based
The above algorithms are analyzed and compared in Table 1. Metric for Clustering in Mobile Ad Hoc Networks, in Proc.
From this study, several factors that are not considered are IEEE ICDCSW01, Apr. 2001, pp. 41318
identified and proposed a new method which can handled all [2].J.-H. Ryu, S. Song, and D.-H. Cho, New Clustering
those factors is given below. Schemes for Energy Conservation in Two-Tiered Mobile Ad-
Hoc Networks,in Proc. IEEE ICC01, vo1. 3, June 2001, pp.
86266.
3. PROPOSED METHOD [3]A. D. Amis and R. Prakash, Load-Balancing Clusters in
The proposed method suggests an algorithm to create a clustered Wireless Ad Hoc Networks, in Proc. 3rd IEEE ASSET00,
architecture which is to concentrate on security. It is used to Mar. 2000pp. 2532.
identify malicious node and avoid attacks. [4].C. R. Lin, M. Gerla, Adaptive clustering for mobile
wireless networks, IEEE Journal on Selected Areas in
3.1 Algorithm Communications, pp.1265-1275, 1997.
[5]. I. Er and W. Seah, Mobility-based d-hop clustering
3.1.1Cluster formation Phase algorithm for mobile ad hoc networks, in Wireless
Step 1: Generate an ID for every newly entered node using
Communications and Networking Conference, 2004. WCNC.
Random number generator (any nodes can be entered or exited.
2004 IEEE, vol. 4, 2004, pp. 23592364.
So it supports scalable cluster)
[6]. Pushpita Chatterjee, Indirani sengupta, S.K.Ghosh,A
Step 2: The weights of each node is calculated based on battery
Trust Based Clustering Framework for Securing Ad Hoc
power, stability, Memory capacity, Energy and degree.
Networks,Information system, technology and management
Step 3: Set Maximum and minimum load for cluster.
communication, Information and computer science, Volume
Step 4: If Number of nodes exceeds the limits, then create new
31.
cluster.
[7]Chatterjee, M., Das, S., & Turgut, D. WCA: A Weighted
The ID of the node is used to maintain Authentication.
Clustering Algorithm (WCA) for mobile ad hoc networks.
Cluster Computing (PP. 193- 204). Kluwer Academic, 2002.
3.1.2 Cluster Head (CH) selection phase [8] M. Gerla and J. T. Tsai, Multiuser, Mobile, Multimedia
Step 1: Set a node with highest weight as cluster head. Radio Network, Wireless Networks, vol. 1, pp. 25565, Oct.
1995
3.1.3 Re-election Phase [9]Yao Yu+, Lincong Zhang A Secure Clustering Algorithm in
Step 1: Check the battery power of CH with threshold for Mobile Ad Hoc Networks IPCSIT vol. 29 (2012).
regular time interval.
81 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security
National Conference on Research Issues in Image Analysis & Mining Intelligence (NCRIIAMI-2015)
[10]. Alak Roy, Manasi Hazarika2, Mrinal Kanti Debbarma3 [11].M.Anandhi, Prof.T.N.Ravi, An Appraisal of Attacks in
Energy Efficient Cluster Based Routing in MANET MANET and Fortification Methods, IJARCS, Volume 5, No. 8,
2012,International Conference on Communication, Information Nov-Dec 2014, ISSN No. 0976-5697
& Computing Technology (ICCICT), Oct. 19-20.
2. Highest degree Node with more Less hop needed to satisfy Topology changes occurs x X x x
neighbors any request due to mobility
3. Weighed cluster Nodes with minimum maximize the stability of Complexity in cluster x
algorithm weight cluster topology maintenance
4. Trust based The node with highest Authenticate the valid Mobility and Power X x x
clustering Recommendation node energy is not considered
certificate
5. Signal and Nodes with higher Pay attention for CH Complexity in x x x
energy efficient energy and signal level election maintenance
Avoid re-election
6. Load balancing Node with optimal load Each cluster handle Scalability not supported, X x x
limited nodes High maintenance
7. Mobility based Most stable node Focuses to make cluster It assumes each node can x X
d-hop clustering diameter more flexible. measure its received
It considers cluster signal strength depending
maintenance during its upon the closeness
third phase. between two nodes.
82 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security
National Conference on Research Issues in Image Analysis & Mining Intelligence (NCRIIAMI-2015)
1.1 CRYPTOGRAPHY
Cryptography is a method of storing and transmitting data in a
particular form so that only those for whom it is intended can Fig 1: Encryption and Decryption
read and process it.
2. PURPOSE OF CRYPTOGRAPHY
1.1.1 .Plain Text
2.1 Confidentiality
The original message that the person wishes to communicate
with the other is defined as Plain Text. In cryptography the When we talk about confidentiality of information, we are
actual message that has to be send to the other end is given a talking about protecting the information from disclosure to
special name as Plain Text. unauthorized parties. Information has value, especially in
todays world. A very key component of protecting
1.1.2 Cipher Text information confidentiality would be encryption. Encryption
ensures that only the right people can read the information.
The message that cannot be understood by anyone or Information in computer is transmitted and has to be accessed
meaningless message is what we call as Cipher Text. In only by the authorized party and not by anyone else.
Cryptography the original message is transformed into non
readable message before the transmission of actual message. 2.2. Integrity
1.1.3 Encryption Only the authorized party is allowed to modify the transmitted
information. No one in between the sender and receiver are
A process of converting Plain Text into Cipher Text is called allowed to alter the given message. Integrity of information
as Encryption. Cryptography uses the encryption technique to refers to protecting information from being modified by
send confidential messages through an insecure channel. The unauthorized parties. Information only has value if it is
process of encryption requires two things- an encryption
83 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security
National Conference on Research Issues in Image Analysis & Mining Intelligence (NCRIIAMI-2015)
2.3 Non-repudiation
Ensures that neither the sender, nor the receiver of message
should be able to deny the transmission. Non-repudiation is
the assurance that someone cannot deny something. Typically,
Non-repudiation refers to the ability to ensure that a party to a
contract or a communication cannot deny the authenticity of Fig 2: Symmetric Key Cryptography
their signature on a document or the sending of a message that
they originated. A symmetric cryptosystem is faster. In Symmetric
Cryptosystems, encrypted data can be transferred on the link
2.4. Authentication even if there is a possibility that the data will be intercepted.
Since there is no key transmitted with the data, the chances of
The process of proving one's identity. The information data being decrypted are null. A symmetric cryptosystem uses
received by any system has to check the identity of the sender password authentication to prove the receivers identity. A
that whether the information is arriving from a authorized system only which possesses the secret key can decrypt a
person or a false identity. message.
2.5. Access Control 3.1.1 Limitation Symmetric Key Cryptography
Only the authorized parties are able to access the given Symmetric cryptosystems have a problem of key
information. transportation. The secret key is to be transmitted to the
receiving system before the actual message is to be
3. TYPES OF CRYPTOGRAPHY transmitted. Every means of electronic communication is
There are different types of cryptography. There is a sender, insecure as it is impossible to guarantee that no one will
receiver, intruder of information and cryptographic tool that be able to tap communication channels. So the only
prevents intruder from trespass the sensitive information. secure way of exchanging keys would be exchanging
them personally.
Symmetric Key Cryptography or Secret Key
Cryptography Cannot provide digital signatures that cannot be
repudiated.
Asymmetric Key Cryptography or Public Key
Cryptography A private key encryption system is that it requires
anyone new to gain access to the key. This access may
Hash Functions require transmitting the key over an insecure method of
communication.
Key Escrow Cryptography
Translucent Cryptography 3.2 Public Key Cryptography
It involves two pairs of keys: one for encryption and another
3.1 Symmetric Key Cryptography for decryption. Key used for encryption is a public key and
In symmetric Cryptography the key used for encryption is distributed. On the other hand key used for decryption is
similar to the key used in decryption. The sender and recipient private key. Unlike secret key cryptography, keys are not
of data must share same key and keep information secret shared. Instead, each individual has two keys: a private key
preventing data access from outside. Thus the key distribution that need not be revealed to anyone, and a public key that is
has to be made prior to the transmission of information. The preferably known to the entire world. We will use the letter e
key plays a very important role in symmetric cryptography to refer to the public key, since the public key is used when
since their security directly depends on the nature of key i.e. encrypting a message. Well use the letter d to refer to the
the key length etc. the bigger difficulty with this approach, is private key, because the private key is used to decrypt a
the distribution of the key. message. Encryption and decryption are two mathematical
functions that are inverses of each other.
Secret key cryptography schemes are generally categorized as
being either stream ciphers or block ciphers. Stream ciphers
operate on a single bit at a time and implement some form of
feedback mechanism so that the key is constantly changing. A
block cipher is so-called because the scheme encrypts one
block of data at a time using the same key on each block. In
general, the same plaintext block will always encrypt to the
same ciphertext when using the same key in a block cipher
whereas the same plaintext will encrypt to different ciphertext
in a stream cipher. There are various symmetric key
algorithms such as DES, TRIPLE DES, AES, RC4, RC6, and
BLOWFISH.
84 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security
National Conference on Research Issues in Image Analysis & Mining Intelligence (NCRIIAMI-2015)
In asymmetric or public key, cryptography there is no need In this scheme the government can decrypt some of the
for exchanging keys, thus eliminate the key distribution messages, but not all. Only p fraction of message can be
problem. The primary advantage of public-key cryptography decrypted and 1-p cannot be decrypted. This is advantageous
is increased security: the private keys do not ever need to be over key escrow or no key escrow cryptography as entire
transmitted or revealed to anyone. Can provide digital information is not at security risk.
signatures that can be repudiated
3.5.1 Limitation of Translucent Cryptography
3.2.1 Limitation Public Key Cryptography
Law enforcement may be frustrated that when it has an
A disadvantage of using public-key cryptography for authorized wiretap it is not getting decryption of all of
encryption is speed: there are popular secret-key the messages
encryption methods which are significantly faster than
any currently available public-key encryption method. Individuals may be frustrated that this scheme does not
provide absolute privacy for their messages, law
3.3. Hash Function enforcement can read some fraction of their messages.
85 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security
National Conference on Research Issues in Image Analysis & Mining Intelligence (NCRIIAMI-2015)
5. CONCLUSION
In this research different type of cryptography like Symmetric
key cryptography, Asymmetric key cryptography, Hash
function, Key escrow cryptography, and Translucent
cryptography limitation and its application is analysed. The
public-key cryptography is increased security and
convenience: private keys never need to transmitted or
revealed to anyone. In a secret-key system, by contrast, the
secret keys must be transmitted and there may be a chance
that an enemy can discover the secret keys during their
transmission. Hash tables can be more difficult to implement
than self-balancing binary search trees. In Key escrow
cryptography third party involved in encryption, translucent
cryptography overcome key escrow cryptography but it also
have some limitation. By removing these limitations we have
used this algorithm in various field of our life for safe data
transferring.
6. REFERENCE
[1] Shivangi Goyal A Survey on the Applications of
Cryptography International Journal of Science and
Technology Volume 1 No. 3, March, 2012
[2] Vishwa Gupta, Gajendra singh, Ravindra Guptha
Advance cryptography algorithm for improving data
security, IJARCSSE Volume2, Issue1, January 2012.
[3] Mohsin Khan, 2 Sadaf Hussain, Malik Imran
Performance Evaluation of Symmetric Cryptography
Algorithms: A Survey International Journal of Information
Technology and Electrical Engineering Volume 2, Issue 2
April 2013
[4] National Bureau of Standards, Data Encryption
Standard, FIPS Publication 46, 1977.
[5] William Stallings, Cryptography and Network security,
4th edition. Prentice Hall, November18 2006.
[6] Diaa Salma and Hateem Muhammad, "Evaluating The
Performance of Symmetric Encryption Algorithms,"
International Journal of Network Security.
[7] Monika Agrawal, Pradeep Mishra, A Comparative
Survey on Symmetric Key Encryption Techniques,
International Journal on Computer Science and Engineering
(IJCSE), Vol. 4 No. 05 May 2012.
[8] Ketu File with paper, Symmetric vsnAsymmetric
Encryption, a division of Midwest Research Coporation.
86 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security
National Conference on Research Issues in Image Analysis & Mining Intelligence (NCRIIAMI-2015)
87 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security
National Conference on Research Issues in Image Analysis & Mining Intelligence (NCRIIAMI-2015)
Line Word Character
a. Capital Alphabets
Segment Segment Segment
d. Numbers
a. Capital Alphabets
88 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security
National Conference on Research Issues in Image Analysis & Mining Intelligence (NCRIIAMI-2015)
a. Capital Alphabets
b. Small Alphabets
5. Conclusion
The proposed image segmentation methods have
been tested on number of documented image and hand written
images. The use a set of quantitative evaluation measurements
for the image segmentation .The system is designed in such a
way the text in documented image is detected and segmented
automatically. Line segmentation is done by using horizontal
projection profile and vertical projection profile analysis.
6. References
1. Igor Tchouchenkov, Heinz Wrn, Optical Character
Recognition Using Optimisation Algorithms,
Proceedings of the 9th International Workshop on
Computer Science and Information Technologies
CSIT2007, Ufa, Russia, 2007.
2. Mantas Paulinas and Andrius Usinskas, A Survey of
Genetic Algorithms Applicatons for Image Enhancement
and Segmentation, Information Technology and Control,
Vol.36, No.3, pp.278-284, 2007.
3. Punam Thakare, A Study of Image Segmentation and
Edge Detection Techniques, International Journal on
Computer Science and Engineering Vol. 3 No. 2 , 2011.
4. Neha Gupta, V.K. Banga, Image Segmentation for Text
Extraction, ICEECE2012, 2012.
5. Sunil Kumar, Rajat Gupta, Nitin Khanna, Santanu
Chaudhury, and Shiv Dutt Joshi, Text Extraction and
Document Image Segmentation Using Matched Wavelets
and MRF Model , IEEE Transactions On Image
Processing, Vol.16, No.8, pp.2117-2128, 2007.
6. Yi Xiao,Hong Yan, Text region extraction in Document
image based on the Delaunay tessellation, The Journal of
Pattern Recognition Society, 2003.
7. H. Tran, A. Lux, H. L. Nguyen, and A. Boucher, A novel
approach for text detection in images using structural
features, in Proc. 3rd Int. Conf. Adv. Pattern
Recognition., pp. 627635, 2005.
89 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security
National Conference on Research Issues in Image Analysis & Mining Intelligence (NCRIIAMI-2015)
ABSTRACT
Due to the rapid development in the information technology, recommendation, similarity measures, evaluation metrics used
online business has been emerged and the products sale has in CF system.
been accomplished through various websites. The Billsus D et al. [3] specified that since the correlation
recommender system helps the users to select the valuable coefficient is only defined between customers who have rated
products. In such system, sparsity is the major problem. In at least two products in common, many pairs of customers
this paper, we propose a methodology to preprocess the data. have no correlation at all. Su et al. [4] used PMM (predictive
and two methods are used to impute missing values and mean matching, a state-of-the-art imputation technique) to
finally we select the best method based on time complexity. impute the missing values.
Ghazanfar M A et al. [5] proposed nearly 17 approaches to
General Terms approximate the missing values in the user-item rating
Imputation, Row elimination, Column elimination matrix.The main approaches are based on various classifiers
such as k-nearest neighbor, Nave bayes and support vector
Keywords machine.
Collaborative Filtering (CF), Singular Value Decomposition Raiko T et al. [6] proposed a gradient descent algorithm
(SVD), Principal Component Analysis (PCA). inspired by Ojas rule, and speeded it up by an approximate
Newtons method. Noha H et al. [7] proposed multiple
1. INTRODUCTION imputation based collaborative filtering to solve the sparsity
Information Technology is the application of computers and problem..
telecommunications equipments used to transmit and Chen Y et al. [8] solved the sparsity problem in recommender
manipulate data, often in the context of a business or other system by retrieving association between users.
enterprise. Kumar B D et al. [9] explained the role of matrix factorization
Data Mining (DM) is an analytic process designed to extract methods such as SVD, PCA, Probabilistic matrix factorization
useful information from large amounts of data. It includes in recommender system.
various techniques such as preprocessing, supervised learning
methods and unsupervised learning methods. 3. METHODOLOGY
Recommender system is a system that provides suggestions to 3.1 Preprocessing Framework
users what product to choose. Such system is otherwise called It is one of the important data mining tasks. As movie dataset
as recommended system. Mainly there are three types of is a sparse dataset, it is better to preprocess the data. The
recommender systems such as Content based, Collaborative proposed preprocessing technique involves three steps as
filtering (CF) and hybrid system. The collaborative filtering shown in figure 1.
system can be categorized into two as user-based and item-
based.
In user-based CF system, the ratings given by the users are
represented as (M x N) user-item matrix whose elements are
the ratings given by M users for corresponding N items.
Usually the user-item matrix is a sparse matrix. This paper has
been organized as follows. Section 2 discusses the related
works in detail. Section 3 discusses about data format, row
elimination, column elimination and imputation of missing
values. Section 4 discusses the imputation methods in detail
and analyze it. Section 5 presents the pseudo code for the
preprocessing technique. Section 6 discusses the experimental
results. Section 7 we conclude the analysis. Section 8
specifies the future enhancement.
2. RELATED WORKS
Nilashi M et al. [2] presented the basic concepts such as data
format, CF techniques, CF tasks such as prediction and
Fig 1: Steps involved in Preprocessing Technique
90 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security
National Conference on Research Issues in Image Analysis & Mining Intelligence (NCRIIAMI-2015)
3.2 Row And Column Elimination The new data matrix X contains most of the information of
Row elimination is a process used to eliminate the users who the original X with a imputed values.
have rated less than 100 movies of a total of 1682 movies. 3.5 ALGORITHM FOR
Column elimination is a process used to eliminate the movies
that have been rated by less than 50 users of a total of 943
PREPROCESSING TECHNIQUE
The algorithm for the preprocessing technique used in this
users.
paper is as follows:
3.3 Imputation of Missing Values
The movie dataset consists of 1486126 (93.7%) missing
values. Using such data set for further data mining techniques Input:
such as classification, clustering and association rule mining Dataset with missing values
leads to poor result. So to improve the accuracy, we impute Output:
the missing values. The missing values can be filled by
various methods such as Singular Value Decomposition and Dataset with imputed values
Principal component analysis. Procedure:
Step 1: Eliminate rows with ratings < 100
3.4 IMPUTATION METHODS
Step 2: Eliminate columns with ratings < 50
3.4.1 Singular Value Decomposition (SVD) Step 3: Perform Singular value decomposition to
Usually the researchers used SVD for dimensionality
reduction. But SVD can be used to impute the missing values impute the missing values and solve
in the dataset of CF system. sparsity problem
91 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security
National Conference on Research Issues in Image Analysis & Mining Intelligence (NCRIIAMI-2015)
4.3 Result Of Imputation Using SVD Recommender Systems Workshop. Tech. Report WS-98-
The result of the imputation methods such as SVD and PCA 08, AAAI Press
are given in the table 3. [4] Su, X., Khoshgoftaar, T. M., Zhu, X., & Greiner, R.
Table 3. Imputation result 2008, March. Imputation-boosted collaborative filtering
using machine learning classifiers. In Proceedings of the
Imputation Time Taken 2008 ACM symposium on Applied computing (pp. 949-
Remarks
Method (Msec) 950). ACM.
SVD 0.69 Efficient
Bad in time taken and [5] Ghazanfar, M. A., & Prgel-Bennett, A. 2013. The
PCA 130.60 increases number of Advantage of Careful Imputation Sources in Sparse
columns Data-Environment of Recommender Systems:
Generating Improved SVD-based
Recommendations. Informatica (Slovenia),37(1), 61-92.
Among the two imputation methods, we considered the [6] Raiko, T., Ilin, A., & Karhunen, J. 2008, January.
methods such as SVD and PCA. Singular Value Principal component analysis for sparse high-
decomposition method is the best method because it takes less dimensional data. In Neural Information Processing(pp.
time, does not reduces or increases the dimension. In this 566-575). Springer Berlin Heidelberg.
paper, we concentrated on the sparsity problem in CF system. [7] Noha, H., Kwak, M., & Hanc, I. 2003. Handling
The sparsity has been solved using the preprocessing incomplete data problem in collaborative filtering
technique in this paper as shown in figure 3. system. Journal of Intelligent Information Systems, 9(2),
51.
[8] Chen, Y., Wu, C., Xie, M., & Guo, X. 2011. Solving the
Sparsity Problem in Recommender Systems Using
Association Retrieval. Journal of computers,6(9), 1896-
1902.
[9] kumar Bokde, D., Girase, S., & Mukhopadhyay, D.
2014. Role of Matrix Factorization Model in
Collaborative Filtering Algorithm: A Survey.
International Journal of Advance Foundation and
Research in Computer (IJAFRC) Volume 1, Issue 12, pp.
2348 4853
[10] Pearson, K. 1901. LIII. On lines and planes of closest fit
to systems of points in space. The London, Edinburgh,
Fig 3 : Levels Of Sparsity and Dublin Philosophical Magazine and Journal of
Science, 2(11), 559-572.
5. CONCLUSION
In this paper we discussed the preprocessing framework
which consists of three tasks such as row elimination, column
elimination and imputation of missing values. We also
compared two imputation methods such as SVD and PCA.
We selected SVD as the best method to impute missing
values. We conclude that the sparsity problem in CF system
can be solved by using this framework.
6. FUTURE ENHANCEMENT
In future, this preprocessed movie dataset can be used for
prediction of ratings for new users and recommendation of
movies. It will lead us to attain good accuracy of prediction
and recommendation. This preprocessing technique can be
applied to other data sets such as Jokes, Books etc.
7. REFERENCES
[1] http://www.movielens.org
[2] Nilashi.M., Bagherifard.K., Ibrahim.O., Alizadeh.H.,
Nojeem.L.A and Roozegar.N., 2013.Collaborative
Filtering Recommender Systems, Research Journal of
Applied Sciences, Engineering and Technology,(
vol.5,no.16,pp. 4168-4182)
[3] Billsus, D., and Pazzani, M. J. 1998. Learning
Collaborative Information Filters. In Proceedings of
92 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security
National Conference on Research Issues in Image Analysis & Mining Intelligence (NCRIIAMI-2015)
ABSTRACT highly too noisy, missing, and inconsistent data due to their
Data mining is the non-trival process as identifying valid, huge size and multiple sources. Low quality data will lead to
novel, potentially useful and ultimately understandable low-quality mining results. So, we prefer a preprocessing
patterns in data. Data mining is defined as the automatic concept:
extraction of unknown, useful and understandable patterns
from large DB. Utilizing these massive volumes of data is
becoming a major problem for all enterprises. Raw data is
highly susceptible to noise, missing values, and inconsistency.
The quality of data affects the data mining results. In order to
help improve the quality of the data and consequently, of the
mining results raw data is pre-processed so as to improve the
efficiency. Noise is a random error or variance in a measured
variable. Noise is removed using data smoothing techniques.
This paper provides a study on data smoothing techniques to
removed noise by using Binning method in data mining.
Keywords
Data Mining Techniques, Data pre-processing, Discretization,
Binning method
1. INTRODUCTION
The development of Information Technology has
generated large amount of databases and huge data in various
areas. The research in databases and information technology
has given rise to an approach to store and manipulate this
precious data for further decision making. Data mining is a Fig 2. Forms of Data Pre-processing
process of extraction of useful information and patterns
from huge data[3]. It is also called as knowledge discovery
process, knowledge mining from data, knowledge extraction 3. TASKS IN DATA PREPROCESSING
or data /pattern analysis. 3.1 Data Cleaning
Fill in missing values, smooth noisy data, identify or
remove outliers, and resolve inconsistencies.
Data cleaning tasks are
Fill in missing values
Identify outliers and smooth out noisy
data
Correct inconsistent data
Resolve redundancy caused by data
integration
93 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security
National Conference on Research Issues in Image Analysis & Mining Intelligence (NCRIIAMI-2015)
94 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security
National Conference on Research Issues in Image Analysis & Mining Intelligence (NCRIIAMI-2015)
7. CONCULSION
Data preparation is an important issue for both data
warehousing and data mining, as real world data tends to be
incomplete, noisy, and inconsistent. Data preparation includes
data cleaning, data integration, data transformation, and data
reduction. Data cleaning routines can be used to fill in missing
values, smooth noisy data, identify outliers, and correct data
inconsistencies. A lot of methods have been developed but
still an active area of research.
8. REFERENCES
[1]. Crisp-DM 1.0 Step by step Data Mining guide from
http://www.crisp dm.org/CRISPWP- 0800.pdf.
[2]. Gregory Piatetsky-shapiro Kdnuggest
http://www.kdnuggest.com/data mining concepts/
[3]. Gupta G.K Introduction to data mining with case
studies
[4]. Jewie Han and Micheline kamber, Data mining
concepts and technologies,published by Morgan
Kauffman,2nd Ed,2006.
[5]. Liu.H and setiono.Feature.R Selection and
discretization.IEEE Transaction on knowledge and
Data Engineering, 9:1-4, 1997.
[6]. Pyle.D Data Preparation for Data Mining. Morgan
Kaufmann, 1999.
[7]. Richard J.Roiger and Michal W.Geatz Data Mining
A Tutorial-based Primer
[8]. Tan Pang-Ning, Steinbach, M., Vipin Kumar.
Introduction to Data Mining, Pearson Education,
New Delhi, ISBN: 978-81-317-1472-0, 3rdEdition,
2009.
95 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security
National Conference on Research Issues in Image Analysis & Mining Intelligence (NCRIIAMI-2015)
96 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security
National Conference on Research Issues in Image Analysis & Mining Intelligence (NCRIIAMI-2015)
97 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security
National Conference on Research Issues in Image Analysis & Mining Intelligence (NCRIIAMI-2015)
pixels which have the value 1 and those which have the value
Algorithm: II
0 respectively as shown in Figure.3.2
Input : Target image
Output: Top n relevant images
// Top n images retrieval //
Step 1: Select target image Ti of size() from the image database.
Step 2: Apply step 1 through step 5 of algorithm I.
Step 3: Compute the Euclidean distance for the target image and
the image in the IDB using the Equation--- (7)
Step 4: List top n relevant images.
Step 5: Stop.
( Fs ( I ) , Fs ( I )) = ( Fs ( I ) Fs ( I ))
i=n 2
dE i t i t
i =1
98 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security
National Conference on Research Issues in Image Analysis & Mining Intelligence (NCRIIAMI-2015)
99 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security
National Conference on Research Issues in Image Analysis & Mining Intelligence (NCRIIAMI-2015)
1. INTRODUCTION
In recent times, many projects have initiated research to
investigate Ad-Hoc Networks as a communication technology
for vehicle-specific applications, within the wider fields of
intelligent transportation systems (ITS). Its are computerized
systems that have diverse applications and are connected with
Vehicle Transportation. The computerized systems are made
up of computers, communications, sensor and control
technologies and management strategies [1].
100 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security
National Conference on Research Issues in Image Analysis & Mining Intelligence (NCRIIAMI-2015)
1.3 Vehicle Adhoc Network Architectures To tackle the problem of spoofing attacks in mobile
sensor networks.
VANETs are made up of vehicles (which are equipped with
on board units) and road-side infrastructure units (RSUs). The The scheme detects mobile replicas in an efficient
architecture in a VANET is split into 3 domains. In Vehicle and robust manner at the cost of reasonable
domain, Ad-Hoc Domain and Infrastructure Domain. As can overheads.
be seen in the diagram above the In-Vehicle Domain consists The number of attackers when multiple adversaries
of an OBU and many AUs (Application Units). The AUs are masquerading as the same node identity.
user devices for example mobile phones and PDA that
perform certain functions when interacting with the OBU. The The architecture of the system model, which consists of three
Ad-Hoc Domain consists of OBUs in vehicles and RSUs, interactive components:
which are along the roadside.
1. Road Side Unit (RSU):
1.4 Vehicle Adhoc Network Application 2. On-board units (OBUs):
Papadrimitratos et al. [5] maintain that VANETs are 3. Trust authority (TA):
developed as a means to enhance transportation safety and
efficiency. Safety applications are just one of the applications Road Side Unit (RSU):
for VANETs, and are discussed below, together with other The RSU can be deployed at intersections or any area of
applications: interest (e.g., bus stations and parking lot entrances). A typical
Safety Applications RSU also functions as a wireless AP (e.g., IEEE 802.11x)
Automated Highways which provides wireless access to users within its coverage.
IP Based Applications Onboardunits(OBUs):
High Mobility
Location Awareness The OBU are installed on vehicles. A typical OBU can equip
Security with a cheap GPS receiver and a shortrange wireless
communicationmodule(e.g.,DSRCIEEE802.11p[18]).
2. PROBLEM DESCRIPTION
Trustauthority(TA):
Detecting Sybil attacks in urban vehicular networks, however,
is very challenging. First, vehicles are anonymous. There are The TA is responsible for the system initialization and RSU
no chains of trust linking claimed identities to real vehicles. management.TheTAisalsoconnectedtotheRSUbackbone
Second, location privacy of vehicles is of great concern. network.
Location information of vehicles can be very confidential.
DeploymentofRoadSideUnit(Rsu)
2.1 Objectives In Footprint, vehicles require authorized messages issued
2.1.1 Main Objectives from RSUs to form trajectories, which should be statically
installedastheinfrastructure.
To detect same identity based multi adversaries
2.1.3 Drawbacks
To Localization the hacker node
The existing system has following disadvantages,
Cluster based victim node detection in the network
Failed RSU scenario is not covered in existing
Detect the presence of spoofing attacks system.
Determine the number of attackers If an RSU is failed in a given time, the trajectory
Localize multiple adversaries and eliminate them. created at that event will not contain that RSU
informationalongthetrajectory.
To create Mobile Node Network.
So other trajectory with this RSU (prepared after
To make Mobile movement (Random walk) within thisRSUisonline)willlookdistinct.
given speed.
Before comparing two trajectories for similarity,
To update location information to its neighbors. thetrajectoryinformationshouldbemodifiedsuch
To update location information of all nodes to Base that this missed RSU is to be added in the
Station. trajectory.
To make replicate node attack. 2.2 Proposed System and Its Advantages
To make Base station identifies the mobile The proposed system covers all the existing system approach.
replication attack. In addition, RSU failure is also taken into consideration. The
trajectory created at an event with missed RSU, is modified
2.1.2 Specific Objectives before Sybil attack detection. Mis-interpretion of honest
A fast and effective mobile replica node detection vehicle as suspicious node is avoided. The length of the
scheme is proposed using the Sequential Probability messages is linearly increasing when the trajectory length is
Ratio Test. more.
101 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security
National Conference on Research Issues in Image Analysis & Mining Intelligence (NCRIIAMI-2015)
Reducing the message size is not covered in the existing 8. FIND SYBIL ATTACK
system. So the repeated occurrences of adjacent RSUs are
eliminated in the proposed system. For example a trajectory The find Sybil attack is the module is used to detect the
{R1, R4, R5, R6, R4, R5, R7} can be modified as {R1, R4, unauthorized vehicle in the network.
R5, R6, R7} along with length of trajectory is subtracted with 9. COMMUNICATION
two. To implement this, the proposed system covers
Infrastructure Construction, Generating Location-Hidden In the partial signature creation, the input provided as two pair
Trajectory (trajectory reconstruction with missed RSU) and namely private key of the road side unit and public key of the
Sybil attack detection. vehicle, the message should be provided then the message
should be encrypted and partial signature value executed in
2.3 Advantages the application.
1. SHOW NETWORK
In this module, a Typical Vehicular Network (with RSU 4. RESULT AND DISCUSSION
Installed) is shown graphically.
In this section, we examine the performance of Footprint in
2. ADD ROAD SIDE UNIT recognizing forged trajectories (issued by malicious vehicles)
and actual ones (provided by honest vehicles) through trace-
In this module, Road Side unit details are added and saved to
driven simulations. We consider two key metrics:
RSU table. It will detect the sybil attack using the proposed
approach. 4.1. Key Metrics of the Foot Print
3. UPDATE ROAD SIDE UNIT FAILURE In the Sybil attack detection scheme, it is possible that a
In this module, failure occurred in Road Side unit details are trajectory of an honest vehicle could be mixed with other
updated and saved to RSU table. The RSUs Active status trajectories (either malicious Sybil trajectories or other honest
will be set to 0. The status will be announced to all other ones) especially when the length of the trajectory is short.
RSUs by the Trusted Agent. This will cause false alarm of Sybil trajectories. This issue can
be largely mitigated by comparing multiple sets of trajectories
4. ADD NEIGHBOR ROAD SIDE UNIT issued in different events. If the probability for an honest
In this module, Road Side Unit id, Neighbor Road Side Unit trajectory in an event of a vehicle being treated as Sybil.
id and Distance details are added and saved to 1. False positive error: is the proportion of all actual
NeighborRSUs table. This distance information will assist trajectories that are incorrectly identified as forged
the sybil attack detection. trajectories.
5. VIEW ROAD SIDE UNITS 2. False negative error: is the proportion of all forged
In this module, Road Side unit details are fetched from RSU trajectories that are falsely identified as actual trajectories.
table. The records are displayed using data grid view control.
5. CONCLUSION
6. VIEW NEIGHBOR ROAD SIDE UNITS
A Sybil attack detection scheme named Footprint is developed
In this module, Road Side Unit and its Neighbor RSU details for urban vehicular networks. Consecutive authorized
are fetched from Neighbor RSUs table. The records are messages obtained by an anonymous vehicle from RSUs form
displayed using data grid view control. a trajectory to identify the corresponding vehicle. Location
privacy of vehicles is preserved by realizing a location-hidden
7. MESSAGE signature scheme. Utilizing social relationship among
The message module is used to update the message between trajectories, Footprint can find and eliminate Sybil
road side unit to vehicle on board unit. trajectories. The Footprint design can be incrementally
102 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security
National Conference on Research Issues in Image Analysis & Mining Intelligence (NCRIIAMI-2015)
6. REFERENCES
1. Maen M Artimy, William Robertson, and William J
Phillips, Algorithms and Protocols for Wireless and
Mobile Ad Hoc Networks, 1st ed., Azzedine Boukerche,
Ed.: John Wiley & Sons, 2009.
2. Sudipto Das, Security Issues in MANETs,
www.cs.ucsb.edu /~sudipto/ talks/ Security.pps.
3. P Papadimitratos, G Calandriello, J.-P Hubaux, and A
Lioy, "Impact of Vehicular Communications Security on
Transportation Safety," in IEEE INFOCOM Workshop,
Laussane, 2008, pp. 1- 6.
4. A Boukerche, Algorithms and Protocols for Wireless,
Mobile Ad Hoc Networks , 1st ed.: Wiley-IEEE Press , 2009.
5. Many, VANET Vehicular Applications and Inter-
Networking Technologies, 1st ed., Hannes Hartenstein and
Kenneth P. Laberteaux, Eds. West Sussex, United
Kingdom: John Wiley and Sons Ltd, 2010.
6. Yunpeng Wang, Zhenguo Yi, Daxin Tian, and Haiying
Xia, "Safety Message Transmitting Method for Vehicle
Infrastructure Integration," in 6th Advanced Forum on
Transportation of China, Beijing, 2010.
7. Umesh Sehgal, Kuljeet Kaur, and Pawan Kumar, "Security
in Vehicular Ad-hoc Networks," in Second International
Conference on Computer and Electrical Engineering,
2009, pp. 485 - 488.
8. I.A Sumra, H Hasbullah, I Ahmad, and J.-L bin Ab
Manan, "Forming Vehicular Web of Trust in VANET," in
Saudi International Electronics, Communications and
Photonics Conference, 2011.
103 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security
National Conference on Research Issues in Image Analysis & Mining Intelligence (NCRIIAMI-2015)
1. INTRODUCTION
The process of analyzing the data from different perspectives
and summarizing it into useful information is known as Data
mining. The main functionalities of data mining are
characterizations and discrimination, the mining of frequent
patterns, associations and correlations, classification and
regression, clustering analysis, outlier analysis [2]. Clustering
is one of the most interesting and important topic in data Figure 2 : Intra and Inter Cluster Distances
mining. The main aim of clustering is to find intrinsic Data mining is a multi-step process. It requires accessing and
structures of the data, and organize them into a meaningful preparing data for a data mining algorithm, mining the data,
subgroups for further study and analysis. Clustering is a data analyzing results and taking appropriate action. The accessed
mining tool and it has a root in many application areas such as data can be stored in one or more operational databases, a data
104 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security
National Conference on Research Issues in Image Analysis & Mining Intelligence (NCRIIAMI-2015)
warehouse or a flat file. In data mining the data is mined using 2.1. Well-separated clusters
two learning approaches i.e. supervised learning and A cluster is a set of points so that any point in a cluster is
unsupervised clustering. nearest (or more similar) to every other point in the cluster as
compared to any other point that is not in the cluster.
1.1.Supervised Learning
The class label provided for each training tuple is known as
supervised learning. Learning means training data are 2.2. Center-based clusters
analyzed by a classification algorithm. In supervised learning A cluster is a set of objects such that an object in a cluster is
also called directed data mining. The two groups of variables nearest (more similar) to the center of a cluster, than to the
are explanatory variables and dependent variables. The target center of any other cluster. The center of a cluster is often a
of the analysis is to specify a relationship between the centroid.
explanatory variables and the dependent variable as it is done
in regression analysis. To apply directed data mining 2.3. Contiguous clusters
techniques the values of the dependent variable must be A cluster is a set of points so that a point in a cluster is nearest
known for a sufficiently large part of the data set. The training (or more similar) to one or more other points in the cluster as
data includes both the input and the desired results. These compared to any point that is not in the cluster.
methods are fast and accurate. The correct results are known
and are given in inputs to the model during the learning
process. Supervised models are neural network, Multilayer 2.4. Density-based clusters
Perception, Decision trees. A cluster is a dense region of points, which is separated by
according to the low-density regions, from other regions that
1.2 Unsupervised Learning is of high density.
The class label of each training tuple is not known, and the
number of set of classes to be learned may not be known in 2.5. Shared Property or Conceptual
advance is known as unsupervised learning. In unsupervised
learning situations all variables are treated in the same way, Clusters
there is no distinction between explanatory and dependent It finds clusters that share some common property or
variables. The model is not provided with the correct results represent a particular concept.
during the training. It can be used to cluster the input data in
classes on the basis of their statistical properties only. 3.DATA MINING TASKS
From a machine learning perspective clusters correspond to Data mining tasks[3] are generally divided into two major
hidden patterns, the search for clusters is unsupervised categories:
learning, and the resulting system represents a data concept.
From a practical perspective clustering plays an outstanding 3.1 Predictive task
role in data mining applications such as scientific data The goal of this task is to predict the value of one particular
exploration, information retrieval and text mining, spatial attribute, based on values of other attributes. The attributes
database applications, Web analysis, CRM, marketing, that is used for making the prediction is named as independent
medical diagnostics, computational biology, and many others. variable. The other value which is to be predicted is known as
Presenting data by fewer clusters necessarily loses certain fine the Target or dependent value.
details (loss in data compression), but achieves simplification.
It represents many data objects by few clusters, and hence, it
models data by its clusters. Clustering is often one of the first 3.2 Descriptive task
steps in data mining analysis. It identifies groups of related The purpose of this task is surmise underlying relations in
records that can be used as a starting point for exploring data .In descriptive task of data mining, values are
further relationships. Clustering is a data mining (machine independent in nature and it frequently require post-
learning) technique used to place data elements into related processing to validate results.
groups without advance knowledge of the group definitions.
Clustering techniques fall into a group of undirected data
mining tools. The goal of undirected data mining is to 4. CLASSIFICATION OF CLUSTERING
discover structure in the data as a whole. Clustering is the main task of Data Mining. And it is done by
the number of algorithms. The most commonly used
In general, there are two types of attributes associated with algorithms in Clustering are Hierarchical, Partitioning,
input data in clustering algorithms, i.e., numerical attributes, Density and Grid based algorithms.
and categorical attributes. Numerical attributes are those with
a finite or infinite number of ordered values, such as the 4. 1. PARTITIONING ALGORITHMS
height of a person or the x-coordinate of a point on a 2D In partitioning method, [1] a partitioning algorithm arranges
domain. On the other hand, categorical attributes are those all the objects into various partitions, where total number of
with finite unordered values, such as the occupation or the partitions is less than the total number of objects. Partitioning
blood type of a person. Many different clustering techniques algorithms divide data into several subsets. The reason of
have been defined in order to solve the problem from dividing the data into several subsets is that checking all
different perspective, i.e. partition based clustering, density possible subset systems is computationally not feasible; there
based clustering, hierarchical methods and grid-based are certain greedy heuristics schemes are used in the form of
methods etc iterative optimization. Specifically, this means different
relocation schemes that iteratively reassign points between the
2. GENERAL TYPES OF CLUSTERS k clusters. Relocation algorithms gradually improve clusters.
105 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security
National Conference on Research Issues in Image Analysis & Mining Intelligence (NCRIIAMI-2015)
106 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security
National Conference on Research Issues in Image Analysis & Mining Intelligence (NCRIIAMI-2015)
and also can not perform object swapping between set having different size and shape. The algorithms
clusters. parametric the agglomerative method used in the pre-
Most hierarchal algorithm do not revisit once clustering step and the similarity metrics of interest.
constructed clusters with the purpose of DESCRY has a very low computational complexity,
improvement. indeed it requires O(Nmd) time, for high-dimensional
data sets, and O(N log m) time, for low dimensional data
In hierarchical method clustering is done on hierarchical sets, where m can be considered a constant characteristic
decomposition of data. Major drawback of this method is once of the data set. Thus DESCRY scales linearly both the
a step is done then it never be undo. Multiphase clustering size and the dimensionality of the data set [9]. Despite its
BIRCH and Chameleon are used to overcome this issue. low complexity, qualitative results are very good and
comparable with those obtained by state of the art
clustering algorithms. Future work includes, among other
4..3 MULTIPHASE CLUSTERING topics, the investigation of similarity metrics particularly
4.3.1.BIRCH meaningful in high-dimensional spaces, exploiting
BIRCH [3] (Balanced Iterative Reducing And Clustering summaries extracted from the regions associated to
Using Hierarchies) is designed for clustering a large amount midpoints.
of numerical data by integration of hierarchical clustering (at
the initial micro-clustering stage) and other clustering
methods such as iterative partitioning (at the later macro- 4.4.2. GDBSCAN and its Applications
clustering stage). It overcomes the two difficulties of This paper presents the clustering algorithm
agglomerative clustering methods namely, scalability and the GDBSCAN (Density-Based Clustering in Spatial Databases)
inability to undo what was done in the previous step. BIRCH generalizing the density-based algorithm DBSCAN in two
introduces two concepts, clustering feature (CF) and important ways. GDBSCAN can cluster point objects as well
clustering feature tree (CF tree), these both features represent as spatially extended objects according to both, their spatial
all objects into clusters as hierarchical representation. These and their non-spatial attributes. After a review of related work,
structures help the clustering method achieve good speed and the general concept of density-connected sets and an
scalability in large databases and also make it effective for algorithm to discover them were introduced [10]. A
incremental and dynamic clustering of incoming objects. performance evaluation, analytical as well as experimental,
BIRCH examines the data objects two or three times to showed the effectiveness and efficiency of GDBSCAN on
improve the quality of the built CF tree. The drawback of large spatial databases.
BIRCH is that is suitable only for spherical clusters.
4.5.GRID BASED ALGORITHMS
4.3.2. Chameleon Method 4.5.1.STING
Chameleon [9] is a hierarchical clustering algorithm that uses STING [11] (STastical INformation Grid) is a grid-based
dynamic modelling to determine the similarity between pairs multi-resolution clustering technique in which the spatial area
of clusters. In Chameleon, cluster similarity is assessed based is divided into rectangular cells. There are usually several
on how well-connected objects are within a cluster and on the levels of such rectangular cells corresponding to different
proximity of clusters. That is, two clusters are merged if their levels of resolution, and these cells form a hierarchical
interconnectivity is high and they are close together. Thus, structure: each cell at a high level is partitioned to form a
Chameleon does not depend on a static, user-supplied model number of cells at the next lower level. Statistical information
and can automatically adapt to the internal characteristics of regarding the attributes in each grid cell (such as the mean,
the clusters being merged. The merge process facilitates the maximum, and minimum values) is pre-computed and stored.
discovery of natural and homogeneous clusters and applies to Statistical parameters of higher-level cells can easily be
all types of data as long as a similarity function can be computed from the parameters of the lower-level cells. These
specified. parameters include the following: the attribute-independent
parameter, count; the attribute-dependent parameters, mean,
stdev (standard deviation), min (minimum), max (maximum);
4.4.DENSITY-BASED CLUSTERING and the type of distribution that the attribute value in the cell
Density based algorithm continue to grow the given cluster as follows, such as normal, uniform, exponential, or none (if the
long as the density in the neighborhood exceeds certain distribution is unknown).When the data are loaded into the
threshold [6]. This algorithm is suitable for handling noise in database, the parameters count, mean, stdev, min, and max of
the dataset. the bottom-level cells are calculated directly from the data.
The following points are enumerated as the features of this The value of distribution may either be assigned by the user if
algorithm. the distribution type is known beforehand or obtained by
Handles clusters of arbitrary shape hypothesis tests such as the c2 test. The type of distribution of
Handle noise a higher-level cell can be computed based on the majority of
Needs only one scan of the input datasets. distribution types of its corresponding lower-level cells in
Needs density parameters to be initialized. conjunction with a threshold filtering process. If the
DBSCAN, DENCLUE and OPTICS [6] are examples for this distributions of the lower level cells disagree with each other
algorithm. and fail the threshold test, the distribution type of the high-
level cell is set to none.
107 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security
National Conference on Research Issues in Image Analysis & Mining Intelligence (NCRIIAMI-2015)
cell represents the summary information of the data algorithms such as hierarchical, partitioning, grid and density
in the grid cell, independent of the query. based algorithms. Hierarchical clustering is the connectivity
The grid structure facilitates parallel processing and based clustering. Partitioning is the centroid based clustering,
incremental updating. the value of k-mean is set. Density based clusters are defined
The methods efficiency is a major advantage: as area of higher density then the remaining of the data set.
STING goes through the database once to compute Grid based clustering is the fastest processing time that
the statistical parameters of the cells. typically depends on the size of the grid instead of the data.
The grid based methods use the single uniform grid mesh to
The time complexity of generating clusters is O(n), where n is partition the entire problem domain into cells.
the total number of objects. After generating the hierarchical
structure, the query processing time is O(g), where g is the 6. REFERENCES
total number of grid cells at the lowest level, which is usually 1. Anitha, Elavarasi.S, Dr.Akilandeswari.J,
much smaller than n. Because STING uses a multi-resolution Dr.Sathiyabhama.B, A Survey On Partition
approach to cluster analysis, the quality of STING clustering Clustering Algorithm January 2011.
depends on the granularity of the lowest level of the grid
structure. If the granularity is very fine, the cost of processing 2. Arockiam.L.S, Baskar.S., Jeyasimman.L.
will increase substantially; however, if the bottom level of the Clustering Techniques in Data Mining 2012.
grid structure is too coarse, it may reduce the quality of cluster
analysis. As a result, the shapes of the resulting clusters are 3. Han.J, Kamber.M Data Mining: Concepts and
isothetic; that is, all of the cluster boundaries are either Techniques, 3rd edition, 443-491, 2012.
horizontal or vertical, and no diagonal boundary is detected.
This may lower the quality and accuracy of the clusters 4. Ilango.K, Dr.Mohan.V, A Survey of Grid
despite the fast processing time of the technique. Based Clustering Algorithms, International Journal
of Engineering Science and Technology, pp.3441-
4.5.2. CLIQUE 3446, 2010.
CLIQUE [11] (CLustering In QUEst) was the first algorithm
proposed for dimension-growth subspace clustering in high- 5. Gholamreza Esfandani, Mohsen Sayyadi, Amin
dimensional space. In dimension-growth subspace clustering, Namadchian, GDCLU: a new Grid-Density based
the clustering process starts at single-dimensional subspaces CLUstring algorithm, IEEE 13th ACIS
and grows upward to higher-dimensional ones. Because International Conference on Software Engineering,
CLIQUE partitions each dimension like a grid structure and Artificial Intelligence, Networking and
determines whether a cell is dense based on the number of Parallel/Distributed Computing, pp.102-107, 2012.
points it contains, it can also be viewed as an integration of
density-based and grid-based clustering methods. However, 6. Manish Verma, Mauly Srivastava, Neha Chack,
its overall approach is typical of subspace clustering for high- Nidhi Gupta, A Comparative Study of Various
dimensional space, and so it is introduced in this section. Clustering Algorithms in Data Mining,
International Journal of Engineering Research and
The ideas of the CLIQUE clustering algorithm are outlined as Applications (IJERA), Vol. 2, Issue 3, pp.1379-
follows. 1384, 2012.
Given a large set of multidimensional data points, 7. Pragati Shrivastava, Hitesh Gupta, A Review of
the data space is usually not uniformly occupied by Density-Based clustering in Spatial Data,
the data points. CLIQUEs clustering identifies the International Journal of Advanced Computer
sparse and the crowded areas in space (or units), Research ISSN (print), pp.2249-7277, September-
thereby discovering the overall distribution patterns 2012.
of the data set.
8. Ritu Sharma, Afshar alam, Anita Rani K-Means
A unit is dense if the fraction of total data points
Clustering in Spatial Data Mining using Weka
contained in it exceeds an input model parameter. In
Interface, International conference on Advances in
CLIQUE, a cluster is defined as a maximal set of
Communication and Computing Technologies
connected dense units.
(ICACACT proceedings published by International
journal of computer applications (IJCA), pp 26-30,
CLIQUE performs multidimensional clustering in two steps.
2012.
In the first step, CLIQUE partitions the d dimensional data
space into non-overlapping rectangular units, identifying the
9. Umadevi Chezhian, Thanappan Subhash.
dense units among these. This is done in one Dimensional for
M.Raghvan Hierarchical Sequence Clustering
each dimension.
Algorithm for Data Mining Proceedings of the
World Congress on Engineering 2011 Volume III
5. CONCLUSION WCE 2011, July 6 - 8, London, U.K ,2011
The overall goal of the data mining process is to extract
information from a large data set and transform it into an 10. Vijayalakshmi .M, .Renuka Devi M, A Survey of
understandable form for further use. Clustering is important in Different Issue of Different clustering Algorithms
data analysis and data mining applications. It is the task of used in Large Data sets, International Journal of
grouping a set of objects so that objects in the same group are Advanced Research in Computer Science and
more similar to each other than to those in other groups Software Engineering, pp.305-307, 2012.
(clusters). Clustering can be done by the different no. of
108 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security
National Conference on Research Issues in Image Analysis & Mining Intelligence (NCRIIAMI-2015)
109 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security
National Conference on Research Issues in Image Analysis & Mining Intelligence (NCRIIAMI-2015)
R.Rajamani M.RATHIKA
Assistant Professor Research Scholar
Department of Computer Science Department of Computer Science
PSG College of Arts & Science, Coimbatore PSG College of Arts & Science, Coimbatore
rajamani_devadoss@yahoo.co.in rathi2109@gmail.com
110 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security
National Conference on Research Issues in Image Analysis & Mining Intelligence (NCRIIAMI-2015)
until good accuracy is achieved. The result can be evaluated Besides the main culprits Hepatitis B or C, alcohol is
with the use of SVM and the T - Score and SVM and considered a major contributor to primary liver cancer.
Enrichment score. The performance accuracy and Diabetes too is a major risk factor with diabetics being at least
classification time can be compared with one another. The two and a half times more vulnerable to developing chronic
SVM With the enrichment score performed well with higher liver disease (CLD) that can progress to Liver Cancer. Non-
accuracy. alcoholic fatty liver disease (NAFLD) is also an emerging
Huang, YL et al suggested that SVM is used to concern. Liver cancers should not be confused with
construct the liver disease diagnosis system for testing the liver metastases, also known as secondary liver cancer, which
hepatic tumors as benign or malignant. They evaluated 164 is a cancer that originate from organs elsewhere in the body
liver lesions. Out of 164, 80 are malignant and 84 belong to and migrate to the liver. They are formed from either the liver
Hemangioma. The malignant group of tumors includes itself or from structures within the liver, including blood
primary HCC and Meta static tumors. The result shows that a vessels or the bile duct.
total accuracy rate of 81.7% is obtained. A multi class SVM
classifier is also proposed based on statistical learning theory 3.3 Symptoms
The following are the major symptoms of the liver cancer:
Stage I One tumor is found in the liver only A lump below the rib cage on the right side of the
abdomen
Pain near the right shoulder or on the right side of
One tumor is found, but it has spread to the
the abdomen
Stage II blood vessels, OR more than one tumor is
Jaundice (a disease that causes skin to yellow)
present, but they are all smaller than 5 cm.
Unexplained weight loss
Fatigue
There is more than one tumor larger than 5 Nausea or vomiting
Stage III Loss of appetite
cm
Dark-colored urine
The cancer has spread to other locations in
Stage IV 3.4 Stages of Liver Cancer
the body.
Liver cancer stages include the following:
for automatic classification in liver disease. Table 1. Liver Cancer Stages
111 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security
National Conference on Research Issues in Image Analysis & Mining Intelligence (NCRIIAMI-2015)
5) Data mining: It is the crucial step in which clever 5.1 Rule Set Classifiers
techniques are applied to extract patterns potentially useful. Complex decision trees can be difficult to understand, for
6) Pattern evaluation: In this interesting patterns representing instance because information about one class is usually
knowledge are identified based on given measures. distributed throughout them. The aim of the classification is to
7) Knowledge representation: It is the final phase in which the build a classifier based on some cases with some attributes to
discovered knowledge is visually represented to the user. tree. C4.5 introduced an alternative formalism consisting of a
list of rules of the form if A and B and C and ... then class
4.2 Data Mining Concept X, where rules for each class are grouped together. A case is
Data mining is the non-trivial process of identifying classified by finding the first rule whose conditions are
valid, novel, potentially useful, and ultimately understandable satisfied by the case; if no rule is satisfied, the case is assigned
patterns in data. Data mining is the extraction of the hidden to a default class.
predictive information from large databases. It is a powerful
new technology with great potential to analyze important IF conditions THEN conclusion: This kind of rule consists
information in the data warehouse. The effect of data mining of two parts. The rule antecedent (the IF part) contains one or
is extracting interested knowledge from large databases. more conditions about value of predictor attributes where as
In short, data mining refers to extracting the hidden, the rule consequent (THEN part) contains a prediction about
previously unknown and potentially useful information from the value of a goal attribute. An accurate prediction of the
database, and then offering the understandable knowledge, value of a goal attribute will improve decision-making
such as association rules, cluster patterns etc, so as to support process. IF-THEN prediction rules are very popular in data
users for decision-making. The process of data mining is mining; they represent discovered knowledge at a high level
shown in figure 2. of abstraction.
In the health care system it can be applied as follows:
(Symptoms) (Previous--- history) (Causeof--- disease).
Raw Data Example 1: If_then_rule induced in the diagnosis of liver
cancer.
IF Sex = MALE AND EYE_COLOR = Yellow AND
Clean integrated data Blood_Test = alcohol_content_HIGH THEN Diagnosis =
using filtration
Liver_cancer_High_risk.
Symptoms
Knowledge
112 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security
National Conference on Research Issues in Image Analysis & Mining Intelligence (NCRIIAMI-2015)
113 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security
National Conference on Research Issues in Image Analysis & Mining Intelligence (NCRIIAMI-2015)
Step 1: The inputs are formulated as feature vectors. Symposium On Computational Intelligence In
Bioinformatics And Computational Biology, Pp.1-7,
Step 2: By using the kernel function, these feature 2005.
vectors are mapped into a feature space. [3] Wenjing Zhang, Donglai Ma, Wei Yao Medical
Step 3: A division is computed in the feature space Diagnosis Data Mining Based on Improved Apriori
to separate the classes of training vectors. Algorithm Journal of Networks , vol. 9, No. 5, May
2014.
5.6 K-Nearest Neighbor Algorithm [4] Liver Disease Prediction Using Bayesian
K-Nearest neighbor algorithm (KNN) is one of the supervised Classification by S.Dhamodharan An International of
learning algorithms that have been used in many applications Advanced Computer Technology2320 0790.
in the field of data mining, statistical pattern recognition and [5] http://ehealth.eletsonline.com/2014/02/50000-people-in-
many others. It follows a method for classifying objects based india-are-diagnosed-with-liver-cancer-each-year/#sthash.
on closest training examples in the feature space. An object is e3 WMWccl.
classified by a majority of its neighbors. K is always a [6] Muhamad Hariz Muhamad Adnan,Wahidah Husain,
positive integer. The neighbors are selected from a set of Nur'Aini Abdul Rashid Data Mining for Medical
objects for which the correct classification is known. The K- Systems: A Review Proc. of the International
Nearest neighbors algorithm is as follows: Conference on Advances in Computer and Information
1. Determine the parameter K i.e., number of nearest Technology - ACIT 2012.
neighbors beforehand. [7] P. Rajeswari,G. Sophia Reena Human Liver Cancer
2. Distance between the query-instance and all the training Classification using Microarray Gene Expression Data,
samples is calculated using any distance measure algorithm. International Journal of Computer Applications (0975
3. Distances for all the training samples are sorted and nearest 8887) Volume 34 No.6,Nov 2011.
neighbor based on the K-th minimum distance is determined. [8] http://www.cancer.org/cancer/livercancer/detailedguide/l
4. Since the K-NN is supervised learning, get all the iver-cancer-sig ns-symptoms
Categories of your training data for the sorted value which fall [9] Gunasundari, Janairaman, A Study of Textural Analysis
under K. Methods for the Diagnosis of Liver Diseases from
5. The prediction value is measured by using the majority of Abdominal Computed Tomography, International
nearest neighbor. Journal of Computer Applications (0975 8887) Volume
74 No.11, July 2013
To classify an unclassified vector X, the KNN algorithm ranks [10] V. Krishnaiah et al,/ (IJCSIT) International Journal of
the neighbors of X amongst a given set of N data (Xi Ci), Computer Science and Information Technologies, Vol. 4
i=1,2N and uses the class labels Cj (j=1,2..k) of the K most (1) , 2013, 39 - 45.
similar neighbors to predict the class of the new vector X. The
classes of these neighbors are weighted using the similarity [11] Huang, YL, Chen JH, Shen WC (2006) Diagnosis of
between X and each of its neighbors measured by the Hepatic Tumors with Texture Analysis in Non enhanced
Euclidean distance metric. Then X is assigned the classlabel Computed Tomography Images Acad Radiol 2006
with the greatest number of votes among K nearest class 13:713720.
labels. [12] Ada et al., International Journal of Advanced Research in
Computer Science and Software Engineering 3(3),
6. CONCLUSION March - 2013, pp. 131-134
Data mining plays a very important role in scientific purpose. [13] Arun K.Pujari, Data Mining Techniques University
With the use of enormous data from volumes of case Press 2001.
recordings in the database, we can detect the liver cancer [14] Han and M. Kamber, 2001. Data Mining: Concepts and
disease using data mining classification techniques. This Techniques. San Francisco, Morgan Kauffmann
paper presents a review of classification techniques in liver Publishers.
cancer. Many methods such as rule set classifiers, decision [15] Jaiwei Han and Micheline Kamber, Data Mining
trees, neural network, K Nearest neighbor and support vector Concepts and TechniquesSecond Edition.
machine (SVM) are discussed. With the performance of these [16] Shelly Gupta et al./ Indian Journal of Computer Science
techniques that are discussed, SVM is the most often used and Engineering (IJCSE) Data mining classification
classifier on liver cancer selection and higher classification techniques applied for breast cancer diagnosis and
accuracy. In future, we will try to increase the accuracy for prognosis Vol. 2 No. 2 Apr-May 2011
detecting liver cancer disease by increasing the various [17] P.Rajeswari, G.Sophia Reena Analysis of Liver
parameters suggested from the doctors. Better classification Disorder Using Data mining Algorithm , Global Journal
techniques can be incorporated and Different ANN transfer of Computer Science and Technology Vol. 10 Issue 14
functions and SVM kernel functions can be used in future (Ver. 1.0) November 2010
research to improve the classifier performance. [18] R. Mallika, And V. Saravanan, "An SVM Based
Classification Method For Cancer Data Using Minimum
7. REFERENCES Microarray Gene Expressions", World Academy Of
[1] World Cancer Report 2014 de Martel C, Ferlay J, Science, Engineering And Technology 62 2010.
Franceschi S, et al. Global burden of cancers attributable [19] N. Revathy And R. Amalraj, "Accurate Cancer
to infections in 2008: a review and synthetic analysis. Classification Using Expressions Of Very Few Genes",
The Lancet Oncology 2012;13: 607-615. International Journal Of Computer Applications, Vol.14,
[2] Huilin Xiong And Xue-Wen Chen,"Optimized Kernel No.4, 2011
Machines For Cancer Classification Using Gene
Expression Data", Proceedings Of The 2005 IEEE
114 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security
National Conference on Research Issues in Image Analysis & Mining Intelligence (NCRIIAMI-2015)
115 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security
National Conference on Research Issues in Image Analysis & Mining Intelligence (NCRIIAMI-2015)
A machine learning program is successful if and The D3M methodology has the following
only if it does better than what we call the naive characteristics:
prediction. This conforms us with the notion of the
statistical significance of the results of or learning
Context Based Restriction.
programs Domain Knowledge Integration.
Information Content Cooperation between man and machine.
Information content is closely related to Depth Mining.
statistical significance and transparency. A
program that learns the theory is not very useful. Improve knowledge Discovery.
The theory does not contain any information. Interactive result refinement process.
Support to interactive and parallel mining.
1.3 KDD Process
Knowledge discovery from databases [6] has for 2. ONTOLOGICAL ENGINEERING
objective to make knowledge emerge from hidden
patterns in large amounts of data. It generally relies on In recent years the development of ontology
data mining techniques to identify potentially relevant explicit formal specifications of the terms in the domain
hidden models. The effectiveness of data mining and relations among them [4]has been moving from
depends on appropriate preparation of the data and the realm of Artificial-Intelligence laboratories to the
interpretation of the desktops of domain experts. Ontology has become
Results, which are difficult tasks when dealing with common on the World-Wide Web. The ontology on the
heterogeneous, distributed data from multiple sources. Web range from large taxonomies categorizing Web
Knowledge Discovery process have several steps. They sites (such as on Yahoo!) to categorizations of products
are, for sale and their features (such as on Amazon.com).
Data selection The WWW Consortium (W3C) is developing
Cleaning
the Resource Description Framework, a language for
encoding knowledge on Web pages to make it
Enrichment understandable to electronic agents searching for
Coding information. The Defense Advanced Research Projects
Agency (DARPA), in conjunction with the W3C, is
Data mining developing DARPA Agent Markup Language (DAML)
Reporting by extending RDF with more expressive constructs
The data mining process is done with the transformed aimed at facilitating agent interaction on the Web.
code and they get pattern of their magazine sales. By Many disciplines now develop standardized
evaluating the pattern which is results of the data ontology that domain experts can use to share and
mining, they get the knowledge of about their sales. annotate information in their fields. Medicine, for
There are several levels of analysis are used for data example, has produced large, standardized, structured
mining process. They are, Query Tools, Statistical vocabularies such as SNOMED [5] and the semantic
Techniques, Visualization, OLAP tools, Case-based network of the Unified Medical Language System.
Learning, Decision Trees, Association Rules, and Broad general-purpose ontology is emerging as well.
Neural Networks. For example, the United Nations Development Program
Another research topic in data mining that was and Dun & Bradstreet combined their efforts to develop
identified to be important in [8], is data mining for the UNSPSC ontology which provides terminology for
natural and environmental problems. By developing a products and services
data mining ontology we would be able to join to The Artificial-Intelligence literature contains
natural and environmental domains where the degree of many definitions of ontology: many of these contradict
ontology development is really high. In biology one another. For the purposes of this guide an ontology
domains ontologies have been used for different is a formal explicit description of concepts in a domain
purposes [7]: as controlled vocabularies, for of discourse (classes (sometimes called concepts)),
representing encyclopedic knowledge as a specification properties of each concept describing various features
of an information model (MAGEOM, MAGE-ML, and attributes of the concept (slots (sometimes called
MGED ontology [1]), for specification of a data roles or properties)), and restrictions on slots (facets
interchange format and representation of semantics of (sometimes called role restrictions)). Ontology together
data for information integration. with a set of individual instances of classes constitutes a
knowledge base. In reality, there is a fine line where the
1.4 Data Mining Characteristics ontology ends and the knowledge base begins.
Cao and Zhang [2] proposed a methodology called 2.1 Motivation for Ontology
D3M that considers human knowledge and context Why would someone want to develop ontology?
information on the problem at hand during data mining. Some of the reasons are:
116 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security
National Conference on Research Issues in Image Analysis & Mining Intelligence (NCRIIAMI-2015)
To share common understanding of the domain of concentration. These are most likely to be
nouns (substance) or verbs (associations) in sentences
structure of information among people or
software agents that describe our domain.
That is, deciding what we are going to use the
To enable reuse of domain knowledge ontology for and how detailed or general the ontology is
To make domain assumptions explicit going to be will guide many of the modeling decisions.
To separate domain knowledge from the
3. CONCLUSION
operational knowledge
Data Mining plays an important role in many
To analyze domain knowledge real time areas like Education, Business, and
Government Sectors. An effective data mining process
2.2 Developing an Ontology means discovering knowledge from hidden data which
In practical terms, developing ontology includes: are residing in large volume of databases, data sets.
Defining classes in the ontology, Ontology is an abstract model which represents a
Arranging the classes in a taxonomic
common model for domain. Ontology takes an explicit
specification of a conceptualization.
(subclasssuper class) hierarchy,
In this article, we propose the basic concepts
Defining slots and describing allowed values about ontology and survey the list of existing research
for these slots, in data mining with ontology. Ontology is a formal
Filling in the values for slots for instances. explicit description of concepts or classes in a domain
We can then create a knowledge base by defining of discourse is the most important part of the
individual instances of these classes filling in specific knowledge. Among the areas of data mining, the
slot value information and additional slot restrictions. problem of deriving knowledge from data has received
2.3 Types of Ontology a great deal of attention. This paper describes the topics
Ontologeis can be classified according to the of Ontological Engineering, types of Ontology and rules
degree of conceptualization. The types are, for Ontology design.
Top-level ontology 4. REFERENCES
Domain Ontology
[1]. Ball C.A and Brazma A. MGED standards: work
in progress. Omics: a journal of integrative
Application Ontology biology, 0(2):138144, 2006.
Top Level Ontology describes very common [2]. Cao L and Zhang C, Domain-driven data mining:
concept which are independent of a particular problem A practical methodology, IJDWM, vol. 2, no. 4,
or area. They consider the domain knowledge where the pp. 4965, 2006.
Ontology applied. [3]. Chandrasekaran B, Josephson J.R, and Benjamins
Top level Ontology is relevant across domains and J.R,. What are ontologies, and why do we need
includes vocabulary related to things, events, time, them? IEEE Intelligent Systems, 14(1):2026,
space, etc. 1999.
In Domain ontologies the knowledge represented is [4]. Cromp R.T and Campbell W.J, Gruber , Data
specific to a particular domain such as forestry, fishery, Mining of Multi dimensional remotely sensed
etc. They provide vocabularies about concepts in a images In Prece. Of. Int Conference on
domain and their associations or about the theories Information and Knowledge Management, 1993.
governing the domain. [5]. Faure D and Poibeau T, First Experiences of using
Application or task ontologies describe knowledge Semantic Knowledge Learned by ASIUM for
pieces depending both on a particular domain and task. Information Extraction Task using INTEX, Proc.
Therefore, they are related to problem solving methods. Workshop Ontology Learning, 2000.
2.4 Rules for Ontology Design [6]. Heath T and Bizer C, Linked Data: Evolving the
There are some fundamental rules in ontology Web into a Global Data Space (1st edition)
design to which we will refer many times while Synthesis Lectures on the Semantic Web: Theory
developing. These rules may seem rather inflexible. and Technology, 1:1, 1-136. Morgan & Claypool.
They can help, however, to make design decisions in 2011.
many cases. The three fundamental rules are as follows. [7]. Smith B and Shah N. Ontologies for biomedicine
Rule -1: There is no one correct approach to model a How To make them and use them. Tutorial
domain there are always workable alternatives. The notes at ISMB/ECCB 2007.
best explanation almost always depends on the [8]. Yang Q, and Wu X, 10 challenging problems in
application that we have in mind and the extensions that data mining research. International Journal of
we predict. Information Technology and Decision Making,
Rule -2: Ontology development is an iterative process. 5(4):597604, 2006.
Rule -3: Concepts in the ontology should be close to
substance (physical or logical) and associations in our
117 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
CALL FOR PAPERS
International Journal of Computer Science and Information Security
IJCSIS 2015
ISSN: 1947-5500
http://sites.google.com/site/ijcsis/
International Journal Computer Science and Information Security, IJCSIS, is the premier
scholarly venue in the areas of computer science and security issues. IJCSIS 2011 will provide a high
profile, leading edge platform for researchers and engineers alike to publish state-of-the-art research in the
respective fields of information technology and communication security. The journal will feature a diverse
mixture of publication articles including core and applied computer science related topics.
Authors are solicited to contribute to the special issue by submitting articles that illustrate research results,
projects, surveying works and industrial experiences that describe significant advances in the following
areas, but are not limited to. Submissions may span a broad range of topics, e.g.:
Track A: Security
Access control, Anonymity, Audit and audit reduction & Authentication and authorization, Applied
cryptography, Cryptanalysis, Digital Signatures, Biometric security, Boundary control devices,
Certification and accreditation, Cross-layer design for security, Security & Network Management, Data and
system integrity, Database security, Defensive information warfare, Denial of service protection, Intrusion
Detection, Anti-malware, Distributed systems security, Electronic commerce, E-mail security, Spam,
Phishing, E-mail fraud, Virus, worms, Trojan Protection, Grid security, Information hiding and
watermarking & Information survivability, Insider threat protection, Integrity
Intellectual property protection, Internet/Intranet Security, Key management and key recovery, Language-
based security, Mobile and wireless security, Mobile, Ad Hoc and Sensor Network Security, Monitoring
and surveillance, Multimedia security ,Operating system security, Peer-to-peer security, Performance
Evaluations of Protocols & Security Application, Privacy and data protection, Product evaluation criteria
and compliance, Risk evaluation and security certification, Risk/vulnerability assessment, Security &
Network Management, Security Models & protocols, Security threats & countermeasures (DDoS, MiM,
Session Hijacking, Replay attack etc,), Trusted computing, Ubiquitous Computing Security, Virtualization
security, VoIP security, Web 2.0 security, Submission Procedures, Active Defense Systems, Adaptive
Defense Systems, Benchmark, Analysis and Evaluation of Security Systems, Distributed Access Control
and Trust Management, Distributed Attack Systems and Mechanisms, Distributed Intrusion
Detection/Prevention Systems, Denial-of-Service Attacks and Countermeasures, High Performance
Security Systems, Identity Management and Authentication, Implementation, Deployment and
Management of Security Systems, Intelligent Defense Systems, Internet and Network Forensics, Large-
scale Attacks and Defense, RFID Security and Privacy, Security Architectures in Distributed Network
Systems, Security for Critical Infrastructures, Security for P2P systems and Grid Systems, Security in E-
Commerce, Security and Privacy in Wireless Networks, Secure Mobile Agents and Mobile Code, Security
Protocols, Security Simulation and Tools, Security Theory and Tools, Standards and Assurance Methods,
Trusted Computing, Viruses, Worms, and Other Malicious Code, World Wide Web Security, Novel and
emerging secure architecture, Study of attack strategies, attack modeling, Case studies and analysis of
actual attacks, Continuity of Operations during an attack, Key management, Trust management, Intrusion
detection techniques, Intrusion response, alarm management, and correlation analysis, Study of tradeoffs
between security and system performance, Intrusion tolerance systems, Secure protocols, Security in
wireless networks (e.g. mesh networks, sensor networks, etc.), Cryptography and Secure Communications,
Computer Forensics, Recovery and Healing, Security Visualization, Formal Methods in Security, Principles
for Designing a Secure Computing System, Autonomic Security, Internet Security, Security in Health Care
Systems, Security Solutions Using Reconfigurable Computing, Adaptive and Intelligent Defense Systems,
Authentication and Access control, Denial of service attacks and countermeasures, Identity, Route and
Location Anonymity schemes, Intrusion detection and prevention techniques, Cryptography, encryption
algorithms and Key management schemes, Secure routing schemes, Secure neighbor discovery and
localization, Trust establishment and maintenance, Confidentiality and data integrity, Security architectures,
deployments and solutions, Emerging threats to cloud-based services, Security model for new services,
Cloud-aware web service security, Information hiding in Cloud Computing, Securing distributed data
storage in cloud, Security, privacy and trust in mobile computing systems and applications, Middleware
security & Security features: middleware software is an asset on
its own and has to be protected, interaction between security-specific and other middleware features, e.g.,
context-awareness, Middleware-level security monitoring and measurement: metrics and mechanisms
for quantification and evaluation of security enforced by the middleware, Security co-design: trade-off and
co-design between application-based and middleware-based security, Policy-based management:
innovative support for policy-based definition and enforcement of security concerns, Identification and
authentication mechanisms: Means to capture application specific constraints in defining and enforcing
access control rules, Middleware-oriented security patterns: identification of patterns for sound, reusable
security, Security in aspect-based middleware: mechanisms for isolating and enforcing security aspects,
Security in agent-based platforms: protection for mobile code and platforms, Smart Devices: Biometrics,
National ID cards, Embedded Systems Security and TPMs, RFID Systems Security, Smart Card Security,
Pervasive Systems: Digital Rights Management (DRM) in pervasive environments, Intrusion Detection and
Information Filtering, Localization Systems Security (Tracking of People and Goods), Mobile Commerce
Security, Privacy Enhancing Technologies, Security Protocols (for Identification and Authentication,
Confidentiality and Privacy, and Integrity), Ubiquitous Networks: Ad Hoc Networks Security, Delay-
Tolerant Network Security, Domestic Network Security, Peer-to-Peer Networks Security, Security Issues
in Mobile and Ubiquitous Networks, Security of GSM/GPRS/UMTS Systems, Sensor Networks Security,
Vehicular Network Security, Wireless Communication Security: Bluetooth, NFC, WiFi, WiMAX,
WiMedia, others
This Track will emphasize the design, implementation, management and applications of computer
communications, networks and services. Topics of mostly theoretical nature are also welcome, provided
there is clear practical potential in applying the results of such work.
Broadband wireless technologies: LTE, WiMAX, WiRAN, HSDPA, HSUPA, Resource allocation and
interference management, Quality of service and scheduling methods, Capacity planning and dimensioning,
Cross-layer design and Physical layer based issue, Interworking architecture and interoperability, Relay
assisted and cooperative communications, Location and provisioning and mobility management, Call
admission and flow/congestion control, Performance optimization, Channel capacity modeling and analysis,
Middleware Issues: Event-based, publish/subscribe, and message-oriented middleware, Reconfigurable,
adaptable, and reflective middleware approaches, Middleware solutions for reliability, fault tolerance, and
quality-of-service, Scalability of middleware, Context-aware middleware, Autonomic and self-managing
middleware, Evaluation techniques for middleware solutions, Formal methods and tools for designing,
verifying, and evaluating, middleware, Software engineering techniques for middleware, Service oriented
middleware, Agent-based middleware, Security middleware, Network Applications: Network-based
automation, Cloud applications, Ubiquitous and pervasive applications, Collaborative applications, RFID
and sensor network applications, Mobile applications, Smart home applications, Infrastructure monitoring
and control applications, Remote health monitoring, GPS and location-based applications, Networked
vehicles applications, Alert applications, Embeded Computer System, Advanced Control Systems, and
Intelligent Control : Advanced control and measurement, computer and microprocessor-based control,
signal processing, estimation and identification techniques, application specific ICs, nonlinear and
adaptive control, optimal and robot control, intelligent control, evolutionary computing, and intelligent
systems, instrumentation subject to critical conditions, automotive, marine and aero-space control and all
other control applications, Intelligent Control System, Wiring/Wireless Sensor, Signal Control System.
Sensors, Actuators and Systems Integration : Intelligent sensors and actuators, multisensor fusion, sensor
array and multi-channel processing, micro/nano technology, microsensors and microactuators,
instrumentation electronics, MEMS and system integration, wireless sensor, Network Sensor, Hybrid
Sensor, Distributed Sensor Networks. Signal and Image Processing : Digital signal processing theory,
methods, DSP implementation, speech processing, image and multidimensional signal processing, Image
analysis and processing, Image and Multimedia applications, Real-time multimedia signal processing,
Computer vision, Emerging signal processing areas, Remote Sensing, Signal processing in education.
Industrial Informatics: Industrial applications of neural networks, fuzzy algorithms, Neuro-Fuzzy
application, bioInformatics, real-time computer control, real-time information systems, human-machine
interfaces, CAD/CAM/CAT/CIM, virtual reality, industrial communications, flexible manufacturing
systems, industrial automated process, Data Storage Management, Harddisk control, Supply Chain
Management, Logistics applications, Power plant automation, Drives automation. Information Technology,
Management of Information System : Management information systems, Information Management,
Nursing information management, Information System, Information Technology and their application, Data
retrieval, Data Base Management, Decision analysis methods, Information processing, Operations research,
E-Business, E-Commerce, E-Government, Computer Business, Security and risk management, Medical
imaging, Biotechnology, Bio-Medicine, Computer-based information systems in health care, Changing
Access to Patient Information, Healthcare Management Information Technology.
Communication/Computer Network, Transportation Application : On-board diagnostics, Active safety
systems, Communication systems, Wireless technology, Communication application, Navigation and
Guidance, Vision-based applications, Speech interface, Sensor fusion, Networking theory and technologies,
Transportation information, Autonomous vehicle, Vehicle application of affective computing, Advance
Computing technology and their application : Broadband and intelligent networks, Data Mining, Data
fusion, Computational intelligence, Information and data security, Information indexing and retrieval,
Information processing, Information systems and applications, Internet applications and performances,
Knowledge based systems, Knowledge management, Software Engineering, Decision making, Mobile
networks and services, Network management and services, Neural Network, Fuzzy logics, Neuro-Fuzzy,
Expert approaches, Innovation Technology and Management : Innovation and product development,
Emerging advances in business and its applications, Creativity in Internet management and retailing, B2B
and B2C management, Electronic transceiver device for Retail Marketing Industries, Facilities planning
and management, Innovative pervasive computing applications, Programming paradigms for pervasive
systems, Software evolution and maintenance in pervasive systems, Middleware services and agent
technologies, Adaptive, autonomic and context-aware computing, Mobile/Wireless computing systems and
services in pervasive computing, Energy-efficient and green pervasive computing, Communication
architectures for pervasive computing, Ad hoc networks for pervasive communications, Pervasive
opportunistic communications and applications, Enabling technologies for pervasive systems (e.g., wireless
BAN, PAN), Positioning and tracking technologies, Sensors and RFID in pervasive systems, Multimodal
sensing and context for pervasive applications, Pervasive sensing, perception and semantic interpretation,
Smart devices and intelligent environments, Trust, security and privacy issues in pervasive systems, User
interfaces and interaction models, Virtual immersive communications, Wearable computers, Standards and
interfaces for pervasive computing environments, Social and economic models for pervasive systems,
Active and Programmable Networks, Ad Hoc & Sensor Network, Congestion and/or Flow Control, Content
Distribution, Grid Networking, High-speed Network Architectures, Internet Services and Applications,
Optical Networks, Mobile and Wireless Networks, Network Modeling and Simulation, Multicast,
Multimedia Communications, Network Control and Management, Network Protocols, Network
Performance, Network Measurement, Peer to Peer and Overlay Networks, Quality of Service and Quality
of Experience, Ubiquitous Networks, Crosscutting Themes Internet Technologies, Infrastructure,
Services and Applications; Open Source Tools, Open Models and Architectures; Security, Privacy and
Trust; Navigation Systems, Location Based Services; Social Networks and Online Communities; ICT
Convergence, Digital Economy and Digital Divide, Neural Networks, Pattern Recognition, Computer
Vision, Advanced Computing Architectures and New Programming Models, Visualization and Virtual
Reality as Applied to Computational Science, Computer Architecture and Embedded Systems, Technology
in Education, Theoretical Computer Science, Computing Ethics, Computing Practices & Applications
Authors are invited to submit papers through e-mail ijcsiseditor@gmail.com. Submissions must be original
and should not have been published previously or be under consideration for publication while being
evaluated by IJCSIS. Before submission authors should carefully read over the journal's Author Guidelines,
which are located at http://sites.google.com/site/ijcsis/authors-notes .
IJCSIS PUBLICATION 2015
ISSN 1947 5500
http://sites.google.com/site/ijcsis/