Вы находитесь на странице: 1из 132

IJCSIS Vol.

13 July 2015 Special Issue


ISSN 1947-5500

International Journal of
Computer Science
& Information Security

IJCSIS PUBLICATION 2015


Pennsylvania, USA
IJCSIS
ISSN (online): 1947-5500

Please consider to contribute to and/or forward to the appropriate groups the following opportunity to submit and publish
original scientific results.

CALL FOR PAPERS


International Journal of Computer Science and Information Security (IJCSIS)
January-December 2015 Issues

The topics suggested by this issue can be discussed in term of concepts, surveys, state of the art, research,
standards, implementations, running experiments, applications, and industrial case studies. Authors are invited
to submit complete unpublished papers, which are not under review in any other conference or journal in the
following, but not limited to, topic areas.
See authors guide for manuscript preparation and submission guidelines.

Indexed by Google Scholar, DBLP, CiteSeerX, Directory for Open Access Journal (DOAJ), Bielefeld
Academic Search Engine (BASE), SCIRUS, Scopus Database, Cornell University Library, ScientificCommons,
ProQuest, EBSCO and more.
Deadline: see web site
Notification: see web site
Revision: see web site
Publication: see web site

Context-aware systems Agent-based systems


Networking technologies Mobility and multimedia systems
Security in network, systems, and applications Systems performance
Evolutionary computation Networking and telecommunications
Industrial systems Software development and deployment
Evolutionary computation Knowledge virtualization
Autonomic and autonomous systems Systems and networks on the chip
Bio-technologies Knowledge for global defense
Knowledge data systems Information Systems [IS]
Mobile and distance education IPv6 Today - Technology and deployment
Intelligent techniques, logics and systems Modeling
Knowledge processing Software Engineering
Information technologies Optimization
Internet and web technologies Complexity
Digital information processing Natural Language Processing
Cognitive science and knowledge Speech Synthesis
Data Mining

For more topics, please see web site https://sites.google.com/site/ijcsis/

For more information, please visit the journal website (https://sites.google.com/site/ijcsis/)

IJCSIS EDITORIAL BOARD


Professor Yong Li, PhD.
School of Electronic and Information Engineering, Beijing Jiaotong University,
P. R. China

Professor Ying Yang, PhD.


Computer Science Department, Yale University, USA

Professor Hamid Reza Naji, PhD.


Department of Computer Enigneering, Shahid Beheshti University, Tehran, Iran

Professor Elboukhari Mohamed, PhD.


Department of Computer Science, University Mohammed First, Oujda, Morocco

Professor Mokhtar Beldjehem, PhD.


Sainte-Anne University, Halifax, NS, Canada

Professor Yousef Farhaoui, PhD.


Department of Computer Science, Moulay Ismail University, Morocco

Dr. Alex Pappachen James


Queensland Micro-nanotechnology center, Griffith University, Australia

Dr. Sanjay Jasola


Professor and Dean, School of Information and Communication Technology,
Gautam Buddha University

Dr Riktesh Srivastava
Assistant Professor, Information Systems, Skyline University College, University
City of Sharjah, Sharjah, PO 1797, UAE

Dr. Siddhivinayak Kulkarni


University of Ballarat, Ballarat, Victoria, Australia

Dr. T. C. Manjunath
HKBK College of Engg., Bangalore, India

Dr. Naseer Alquraishi


University of Wasit, Iraq

Dr. Shimon K. Modi


Director of Research BSPA Labs, Purdue University, USA

Dr. Jianguo Ding


Norwegian University of Science and Technology (NTNU), Norway

Dr. Jorge A. Ruiz-Vanoye


Universidad Autnoma del Estado de Morelos, Mexico

Prof. Ning Xu
Wuhan University of Technology, China
Dr . Bilal Alatas
Department of Software Engineering, Firat University, Turkey

Dr. Ioannis V. Koskosas


University of Western Macedonia, Greece

Dr Venu Kuthadi
University of Johannesburg, Johannesburg, RSA

Dr. Kai Cong


Intel Corporation, & Computer Science Department, Portland State University, USA

Dr. Omar A. Alzubi


Prince Abdullah Bin Ghazi Faculty of Information Technology
Al-Balqa Applied University (BAU), Jordan
Preface
This book includes the proceedings of the National Conference on Research Issues in Image Analysis &
Mining Intelligence (NCRIIAMI-2015), organized by Erode Arts & Science College, Erode. The
proceedings are a set of rigorously reviewed manuscripts presenting state of practice in innovative
algorithms and techniques in Image Analysis and Mining Intelligence. Each paper received at least two
reviews and authors were required to address review comments prior to presentation and publication.

The technical co-sponsorship provided by International agencies and institutions: Google Scholar,
CiteSeerX, Cornells University Library, Ei Compendex, Scopus, DBLP, DOAJ, ProQuest and EBSCO, is
gratefully appreciated. The excellent contributions of the authors made this volume possible and in
particular, we want to acknowledge the contributions of the following researchers: S. Antoinette Aroul
Jeyanthi, Dr. S. Pannirselvam, P. Rajeswari, Dr. T. N. Ravi, S. Prasath, A. Vinayagam, Dr. C. Kavitha, Dr.
K. Thangadurai, Dr. P. Raajan, K. Aruna Devi, Lydia Packiam Mettilda, D. Rajakumari, Yogalakshmi. S,
Venkatesh Kumar. R, Lawrance. R, S. Ponmani, R. Sridevi, R. Rahinipriyadharshini, Dr. S. K. Jayanthi, V.
Subhashini, D. Jemimah Sweetlin, J. Jebakumari Beulah Vasanthi, R. Sankarasubramanian, Dr. S.
Sukumaran, Dr. G. Anandharaj, D. Roselin Selvarani, J. Daniel Mano, Dr. S. Sathappan, Prof. T.
Ranganayaki, Prof. Dr. M. Venkatachalam, M. Anandhi, , Dr. Antony Selvadoss Thanamani, K. Priyanka,
R. Keerthana, M. Suguna, Dr. K. Meenakshi Sundaram, C. Veerakumar, M. Menaka, R. N. Muhammad
Ilyas, M. Parameswari, R. Pushpalatha, R. Rajamani, M. Rathika, K. Ramya, G. Vijaiprabhu. Particular
thanks go to S. Prasath for his professionalism and support as reviewer for this special issue.

The conference organizers and IJCSIS editorial board are confident that you will find the papers included
in this volume interesting and useful.

We support researchers to succeed by providing high visibility & impact value, prestige and efficient
publication process & service.

For further questions please do not hesitate to contact us at ijcsiseditor@gmail.com.

A complete list of journals can be found at:


http://sites.google.com/site/ijcsis/

IJCSIS Vol. 13, July 2015 Special Issue


ISSN 1947-5500 IJCSIS, USA.

Journal Indexed by (among others):


Bibliographic Information
ISSN: 1947-5500
Monthly publication (Regular Special Issues)
Commenced Publication since May 2009

Editorial / Paper Submissions:


IJCSIS Managing Editor
(ijcsiseditor@gmail.com)
Pennsylvania, USA
Tel: +1 412 390 5159
Acknowledgments

International Journal of Computer Science and Information Security (IJCSIS) Editorial board would like to
thank the committee members of the National Conference on Research Issues in Image Analysis &
Mining Intelligence for help they provided in the realization of this proceedings and paper selection.
Particular thanks go to Dr. R. Shanmugasundaram, Dr. K. Meenakshisundaram, Dr. S. Sukumaran, Prof.
C. Senthilkumar, Prof. M. Parameswari, Prof. T. Ranganayaki and Prof. R. Sankarasubramanian. A final
thanks goes to S. Prasath, IJCSIS Reviewer, for his diligent work to compile and review the selected
papers.
Table of Contents Page
Preface

1. Dynamic Rule Filtering Technique using User Beliefs 1


S. Antoinette Aroul Jeyanthi, Dr. S. Pannirselvam
2. A Critique on Quality of Service in MANET 5
P. Rajeswari, Dr. T. N. Ravi
3. A Hybrid Model for Face Recognition and Retrieval with Multi-scale Features using Auto Correlation 10
Function
S. Prasath, Dr. S. Pannirselvam
4. An analysis of Air pollutant (SO2) using Particle Swarm Optimization (PSO) 19
A. Vinayagam, Dr. C. Kavitha, Dr. K. Thangadurai
5. An Efficient Analysis on Semantic Approaches in various Fields 23
Dr. P. Raajan, K. Aruna Devi, Lydia Packiam Mettilda
6. An Efficient Classification using Machine Learning Approach on Breast Cancer Datasets 26
D. Rajakumari
7. Analysis on SNP tools 31
Yogalakshmi. S, Venkatesh Kumar. R, Lawrance. R
8. Preprocessing of Tamil Handwritten Character Image using Various Filters 35
S. Ponmani, Dr. S. Pannirselvam
9. Protection of Private Data over the Transmission through Electronic Gadgets using RSA Algorithm 40
R. Sridevi, R. Rahinipriyadharshini
10. Review of Techniques Compromising Search Engines: Web Spam 44
Dr. S. K. Jayanthi, V. Subhashini
11. An Efficient Analysis on Feature Selection to Evaluate the Students Performance based on Various 48
Methods
D. Jemimah Sweetlin, J. Jebakumari Beulah Vasanthi, Dr. P. Raajan
12 Improving Performance of Energy Efficient Zone Technique using Location Aided Routing Protocol for 52
MANET
R. Sankarasubramanian, Dr. S. Sukumaran, Dr. G. Anandharaj
13. Problems Encountered in Securing Mobile Database and their Solutions 58
D. Roselin Selvarani, Dr. T. N. Ravi
14. An Enhanced Encryption Technique for Wireless Sensor Networks 63
J. Daniel Mano, Dr. S. Sathappan
15. Investigation on Cyber Crime, Cyber Law and Cyber Security 68
Prof. T. Ranganayaki, Prof. Dr. M. Venkatachalam
16. Enhanced Detection and Prevention of DDoS Attacks using Packet Filtering Technique 72
M. Sivakumar, C. Senthilkumar
17. Scalable, Power Aware and Efficient Cluster for Secured MANET 79
M. Anandhi, Prof. Dr. T. N. Ravi
18. Analysis of Limitation in Cryptography Techniques and Application 83
Dr. Antony Selvadoss Thanamani, K. Priyanka
19. A Novel Scheme for Handwritten Document using Image Segmentation 87
Dr. S. Pannirselvam, S. Prasath, R. Keerthana
20 A Selective Analysis of Imputation Methods for Collaborative Filtering System 90
M. Suguna, Dr. P. Raajan
21. Discretization in Mining using Binning Method 93
Dr. K. Meenakshi Sundaram, C. Veerakumar, M. Menaka
22. A Content based Image Retrieval with Shape Feature Extraction 96
Dr. S. Pannirselvam, R. N. Muhammad Ilyas
23. An Efficient Ad Hoc Network for Theft Prevention under Honest Node Detection 100
M. Parameswari, Dr. S. Sukumaran
24. A Study on Various Approaches handled for Clustering 104
R. Pushpalatha, Dr. K. Meenakshi Sundaram
25. A Study on Various Classification Techniques for Detecting Liver Cancer in Human 110
R. Rajamani, M. Rathika
26. A Study on Ontological Engineering 115
Dr. K. Meenakshi Sundaram, K. Ramya, G. Vijaiprabhu
(IJCSIS) International Journal of Computer Science and Information Security
National Conference on Research Issues in Image Analysis & Mining Intelligence (NCRIIAMI-2015)

Dynamic Rule Filtering Technique using User Beliefs


S. Antoinette Aroul Jeyanthi Dr. S. Pannirselvam,
Associate Professor of Computer Science, Associate Professor of Computer Science,
Pope John Paul II College of Education, Erode Arts & Science College (Autonomous),
Pondicherry, India Erode, India.
jayanthijames@yahoo.com pannirselvam08@gmail.com

ABSTRACT some simple additional information about the structure of the


One of the major problems in the field of actionable data. The basic idea is to apply templates, a form of pattern
knowledge discovery (or data mining) is the interestingness expressions, in information retrieval from the set of
problem. Past research and applications have found that, in discovered rules. Templates can be used to describe the form
practice, it is all too easy to discover a huge number of of interesting rules, and also to specify which rules are not
patterns in a database. Most of these patterns are actually interesting.
useless or uninteresting to the user. But due to the huge This paper proposes a new Dynamic Interactive Rule Filtering
number of patterns, it is difficult for the user to comprehend technique using users belief to prune and filter discovered
them and to identify those interesting to him/her. In this paper, rules.
we propose and implement a dynamic rule filtering technique
to filter the uninteresting patterns based on the users belief. The paper is organized in six sections. After this introduction,
In this technique, objective measures are used as a first filter section 2 presents the related work. Section 3 introduces the
to remove rules that are definitely uninteresting. Subjective concept of interestingness measures and its usage. In section
measures are then be used as a second filter that bring in 4, the proposed filtering technique is explained. Section 5
beliefs into interestingness evaluations. presents the experimental evaluation. Section 6 is a short
conclusion.
General Terms
Data mining, Objective/Subjective measures, Filtering, Rule 2. RELATED WORK
set, Knowledge discovery. Previous research has looked into actionable knowledge
discovery. For example, in discovering actionable knowledge
Keywords for customer relationship management (CRM), Liu et al.
Rule filtering, User beliefs, Uninteresting Patterns, Actionable proposed an approach to suggest actions to reclassify a
knowledge discovery, Association Rules. customer from an undesired status to a desired one while post
processing decision trees to maximum expected net profit [3].
1. INTRODUCTION However, these methods could miss some actions with higher
In data mining, association rule learning is a popular and well net profit. To handle this problem, multiple trees with
researched method for discovering interesting relations different subsets of hard attributes need to be built. To get
between variables in large databases. In recent years the sizes optimal actions, the number of trees could be very large,
of databases has increased rapidly. This has led to a growing especially when there are many hard attributes.
interest in the development of tools capable in the automatic
discovery of implicit information or knowledge within In order to increase profit of an institute to help devise a
databases. The implicit information within databases, and direct-marketing plan, Yang et al. proposed a lazy approach
mainly the interesting association relationships among sets of to use role models to generate advice and plans [4].
objects, that lead to association rules, may disclose useful However, this method failed to provide rules in advance and
patterns for decision support, financial forecast, marketing will incur high computational cost when generating action
policies, medical diagnosis and many other applications. recommendations.
Mining association rules may require iterative scanning of
large databases, which is costly in processing. The numbers of To improve the profitability of a bank, Schrodt proposed
generated patterns are so large that manual inspection and action rules constructed from certain pairs of classification
analysis is impractical if not impossible. To prevent the user rules [5]. Ras and Tzacheva defined interesting action rules as
from being overwhelmed by the large number of patterns, the rules of the lowest cost [6].
techniques are needed to filter uninteresting patterns
according to users interestingness. Ras et al. proposed a heuristic strategy to construct
interesting action rules [7]. Wang et al. combined the action
In this paper we consider the problem of filtering forest algorithm to extract action rules and proposed a
uninteresting/irrelevant rules from the set of all association heuristic strategy to generate interesting action rules [8].
rules holding in a data set. This is a generic problem in data
mining, while formal statistical criteria for rule strength and Tzacheva and Tsay introduced the notion of cost and
(statistical) significance abound, it is much harder to know feasibility of an action rule and proposed a graph based
which of the discovered rules really uninteresting to the user. method to search and construct feasible action rules at the
Of course, this problem is quite hard. The issue of lowest cost [9]. Previous studies on mining action rules all
interestingness of discovered knowledge has been discussed produced actionable rules based on a certain pair of
in general by Piatetsky-Shapiro [1]. Hoschka and Kl osgen classification rules or a single classification rule. A main
[2] have also used templates for defining interesting drawback of these approaches is that some interesting
knowledge, and their ideas have strongly in-fluenced our actionable rules can be missed.
work. We show that it can be helpful to inquire the user for

1 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security
National Conference on Research Issues in Image Analysis & Mining Intelligence (NCRIIAMI-2015)

To address this problem, He et al. proposed another strategy 3.2 Subjective Measures of Interestingness
in a support-confidence-cost framework to discover action Domain experts (users) play an important role in the
rules directly from a database [10]. Ras et al. proposed an interpretation and subsequent application of data mining
approach to generate association-type action rules [11]. results. Hence the need to incorporate user-views, in addition
Subsequently, Ras and Dardzinska proposed a bottom-up to data related aspects. Generally, users differ in their beliefs
strategy to discover action rules without using pre-existing and interests since they may possess varied experience,
classification rules [12]. knowledge and psychological make-up. They may be
interested in different aspects of the domain depending on
3. INTERESTINGNESS MEASURES their area of work. In addition, they may also have varying
In recent years, a lot of work has been done in defining and goals and difference of opinions about the applicability and
quantifying interestingness. As a result, several measures usefulness of KDD results. This variation in interest enhances
that view interestingness from different perspectives have the importance of injecting subjectivity into interestingness
been proposed, developed and applied. Interestingness evaluation. Actionability and unexpectedness are two
measures attempt to capture the amount of interest any facets that determine subjective interestingness [14]. Another
pattern is expected to evoke on inspection. Merriam feature of interestingness pertains to prior knowledge of the
Websters collegiate dictionary defines interest as a feeling user. Prior knowledge increases the interestingness about a
that accompanies or causes special attention to an object or subject.
class of objects, or something that arouses such attention.
Interesting patterns are supposed to arouse strong attention Interestingness measures can play an important role in the
from users. A user might find a pattern interesting due to identification of novel, relevant, implicit and understandable
various reasons some of which may be difficult to articulate. patterns from the multitude of mined patterns. They help in
It has been found that interestingness is an elusive concept automating most of the post-processing work in data mining.
[13] consisting of many facets that are difficult to Application of interestingness measures during the various
operationalize and therefore difficult to capture. In some phases of data mining has its own advantages and
cases, a particular behavior in a domain might be interesting. disadvantages. If the dataset is large, then it may be
The same behavior exhibited in another domain may not be advantageous to mine rules and then apply interestingness
interesting. Thus, interestingness may be domain and user- measures. On the other hand, for a small one, it may be
dependent. In some other cases the same features may be preferable to apply interestingness measures during the
domain and user-independent. Capturing all features of mining phase itself.
interestingness in one single measure simultaneously is an
arduous if not an impossible task. 4. PROPOSED TECHNIQUE
In this paper, we proposed and implemented a dynamic rule
An important and useful classification scheme of filtering technique to filter the uninteresting patterns based on
interestingness measures may be based on user-involvement. the users belief. In this technique, objective measures are
This results in two categories - objective and subjective used as a first filter to remove rules that are definitely
measures [14][15]. Objective measures usually deal with data- uninteresting. Subjective measures are then be used as a
related aspects such as its distribution, structure of the rule second filter that bring in user beliefs.
and others while subjective measures are more user-driven.

In the proposed technique, we identify interesting rules


3.1 Objective Measures of Interestingness indirectly by the elimination of uninteresting rules.
Objective measures quantify a patterns interestingness in Effectiveness of this approach lies in its ability to quickly
terms of the patterns structure and the underlying data used in eliminate large families of rules that are not interesting. Fig.1
the discovery process. Typical objective measures of shows the proposed framework.
interestingness include statistical measures like confidence,
support, lift, conviction, rule interest and collective strength.
Objective measures are strongly domain and user-
independent. They reveal data characteristics that are not tied
to domain/user specific definitions. This increases their USER BELIEFS
applicability in different situations. However, this property of Rule Sets
independence may limit their power of discrimination. Since
any objective measure has to hold across all domains, it
considers interestingness from a limited perspective that is
common across domains. Hence, objective measures cannot Association

Rule Filtering Filtered
capture all complexities of the discovery process[14]. Many DB Rules Technique Rules
objective measures are based on the strength of dependence
relationships between items[16][17], and interestingness is
POST PROCESSING
regarded as being directly proportional to this strength.

Objective measures have their own limitations and biases. Fig 1: Proposed Framework
Objective measures when used as initial filters to remove
definitely uninteresting or unprofitable rules. Rules that are
statistically insignificant may be removed since they do not First a basic association mining is applied over data and from
warrant further attention. the discovered rules, the definitely uninteresting rules are
filtered by setting objective measures such as support
threshold and confidence threshold. On the other hand not all

2 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security
National Conference on Research Issues in Image Analysis & Mining Intelligence (NCRIIAMI-2015)

rules with high confidence and support are interesting. Rules Age and sex of a patient plays an important role in the
can fail to be interesting for several reasons. medical practitioners decision making process. According to
A rule can correspond to prior knowledge or medical Experts and doctors knowledge/belief, certain
expectation. medical tests are not applicable to certain age groups and sex.
A rule can refer to uninteresting attributes or We are entered those rules in a SQL database, RS. The rules
attribute combinations. can be added/removed dynamically to/from the database. The
Rules can be redundant. discovered patterns are compared with the rules in the RS, if it
matches those patterns are considered as irrelevant and
Hence, we consider user beliefs as a second filter. From the filtered.
first filtered rules we remove the rules that are irrelevant
according to the user beliefs. The user beliefs can be Portion of the Rules Set are shown in the Table.1.
implemented as rules set. We can add/remove rules set
dynamically. The discovered rules are compared with rules set Table 1. Portion of Rules Set
which is used to predict and prune whether the particular
pattern is interesting or not.
Age Group /
ID Test
Sex
4.1 Dynamic Rule Filtering Technique
1 Polio Tests Above 50
The proposed Dynamic Rule Filtering Technique is
implemented with the following algorithm.
2 Pelvic area Male
INPUT : MDB - Master Database
3 Testosterone replacement therapy (TRT) Female
RS - Rules Set consists of uninteresting rules.
(S) - Symptoms 4 Testosterone replacement therapy (TRT) 0-10
OUTPUT: Filtered Rules Set, FRS
5 .. ..
STEP 1: Activate the Master Database, MDB and search for
the Symptoms {s1,s2,..} 6 .. ..
STEP 2: Apply basic mining algorithm and to find all frequent
patterns which satisfies minimum support and minimum
confidence threshold, DB (Subset of MDB after extraction), The Fig.2 shows the screenshot of the dynamic rule filtering
do the following steps technique.
(a) Scan the database to calculate the support of each
item set.
(b) Add the item set to frequent item set if support is
greater than or equal to min_support.
(c) At each level divide the frequent item set into left
hand side and right hand side.
(d) Calculate the confidence of each rule that is
generated.
(e) Generate strong rules satisfying min_support and
min_confidence.
STEP 3: For each rule Rj in DB, do the following steps
(a) Compare Rj with all the Rules in RSi, i=1,2,..n
(b) If Rj matches then remove Rj from DB
Else Include Rj in FRS
STEP 4 : Output the set of filtered Rules, FRS

5. EXPERIMENTAL EVALUATION
The data set of 1000 clinical records having different
attributes such as symptoms, age, sex, disease, tests, Fig 2: Screenshot of Rule Filtering Technique.
treatments are entered in the back end database. In case, any
symptom is searched, the proposed approach search from the
back end database , finds all the frequent diseases, tests, The main parameters we considered for the analysis of
treatments associated with the searched symptoms and filter different association rule mining approaches are scalability;
those patterns which are not satisfy the minimum support and user interesting criteria.Table.2 shows the comparison of
minimum confidence threshold. The size of the discovered different association rule mining approaches.
rules is large and it contains some irrelevant rules.

3 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security
National Conference on Research Issues in Image Analysis & Mining Intelligence (NCRIIAMI-2015)

Table 2. Comparison of various Rule Mining Algorithm [3] B. Liu, W. Hsu, and S. Chen, Using general
impressions to analyze discovered classification rules,
CLOSET GART Proposed in Proceedings of the 3rd International Conference on
Parameter Knowledge Discovery and Data Mining (KDD 97), pp.
Algorithm Algorithm Algorithm 3136, 1997.
[4] Q. Yang, J. Yin, C. X. Ling, and T. Chen,
Postprocessing decision trees to extract actionable
Scalability Yes Yes Yes knowledge, in Proceedings of the 3rd IEEE
International Conference on DataMining (ICDM 03),
pp. 685688, November 2003.
User [5] P. Schrodt, Forecasting conflict in the Balkans using
hidden Markov models, in ProgrammIng for Peace, T.
Interesting No No Yes Robert, Ed., pp. 161184, Springer, 2006.
Criteria [6] Z. Ras and A. Tzacheva, In search for action rules of the
lowest cost, in Monitoring, Security, and Rescue
Techniques in Multiagent Systems, B. Dunin-Keplicz, A.
Jankowski, A. Skowron, and M. Szczuka, Eds., pp. 261
272, Springer, 2005.
Using simulation, we are able to get the efficient results in
[7] Z.W. Ras, L.-S. Tsay, A. A. Tzacheva, andO.Gurdal,
terms of number of rules filtered as compared to the classical Mining for interesting action rules, in Proceedings of
approach. Table 3 shows the comparison of number of the IEEE/WIC/ACM International Conference on
interesting rules filtered when second filter is applied and Intelligent Agent Technology (IAT05), pp. 187193,
September 2005.
when not applied. The number represents that each time
[8] K. Wang, S. Zhou, and Y. He, Growing decision trees
different symptoms are given to filter different set of on support-less association rules, in Proceedings of the
uninteresting rules. Our Experimental evaluation proves that 6th ACM SIGKDD International Conference on
the rules are generated and the filtered rules are uninteresting Knowledge Discovery and Data Mining (KDD 01), pp.
265269, August 2000.
to the user.
[9] A. A. Tzacheva and L.-S. Tsay, Tree-based construction
of lowcost action rules, Fundamenta Informaticae, vol.
Table 3. Number of Rules Filtered. 86, no. 1-2, pp. 213225, 2008.
[10] Z. He, X. Xu, S. Deng, and R. Ma, Mining action rules
Classical from scratch, Expert Systemswith Applications, vol.
Proposed Approach 29,no. 3, pp. 691 699, 2005.
Approach
[11] Z. W. Ras, L.-S. Tsay, A. Dardzinska, and H. Wasyluk,
Association action rules, in Proceedings of the IEEE
No. No. Of Rules Filtered when International Conference on DataMiningWorkshops, pp.
No.of Rules 283290, December 2008.
Second filter
[12] Z. W. Ras and A. Dardzinska, Action rules discovery
Filtered without pre-existing classification rules, in Rough Sets
Not Applied Applied and Current Trends in Computing, vol. 5306 of Lecture
Notes in Computer Science, pp. 181190, Springer,
2008.
1 978 850 535
[13] Fayyad U, Uthurusamy R, Evolving data mining into
solutions for insights, Pg: 28-31,Commun.
2 780 600 255 ACM45(8),2002.
[14] Silberschatz A, Tuzhilin A , What makes patterns
3 855 436 198 interesting in knowledge discovery systems, pg: 970-
974,IEEE Trans. Knowledge Data Eng., 1996.
[15] Freitas A A, On objective measures of rule
4 926 742 511 surprisingness. Proc. Second European Symposium
Principles of Data Mining and Knowledge Discovery,
(PKDD-98), pp 19,1998.
[16] Meo R, Theory of dependence values, ACM Trans.
Database Syst. 25: 380406,2000.
6. CONCLUSION [17] Shekar B, Natarajan R,A transaction-based
This paper discusses the concept of interestingness measures neighbourhood-driven approach to quantifying
and its usage. This paper also proposed and implemented a interestingness of association rules,pp 194-201, Proc.
new dynamic rule filtering technique which uses user beliefs Fourth IEEE Int. Conf. on Data Mining (ICDM ), 2004.
as a second filter and filtered the irrelevant rules. In the
proposed technique, we identify interesting rules indirectly by
the elimination of uninteresting rules. Effectiveness of this
approach lies in its ability to quickly eliminate large families
of rules that are not interesting.

7. REFERENCES
[1] Gregory, Piatetsky-Shapiro, Discovery, analysisand
presentation of large Rules, AAAI Press, pp. 229-248,
Menlo Park,CA, 1991.
[2] Hoschka and Klosgen, A Support System for
Interpreting statistical data, AAAI Press, pp 325-
345,Menlo Park,CA, 1991

4 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security
National Conference on Research Issues in Image Analysis & Mining Intelligence (NCRIIAMI-2015)

A Critique on Quality of Service in MANET


P.Rajeswari Dr.T.N.Ravi
Cauvery College for Women Periyar E.V.R College (Autonomous)
Tiruchirapalli Tiruchirapalli
rajebala19@gmail.com proftnravi@gmail.com

Table 1: Characteristics and Challenges of MANET


Characteristics Challenges
ABSTRACT Can be rapidly organized Packet losses due to
Mobile Ad Hoc networks (MANET) consist of a collection of Do not rely on pre-existing transmission errors
autonomous wireless nodes and exchange data among infrastructure
Mobility
themselves dynamically. These networks are characterized by Adapt to the traffic and
high versatile network topology, high node mobility, low mobility patterns Battery constraints
channel bandwidth and limited battery power. Because of the
rising popularity of multimedia applications and potential Lightweight & Autonomous Ease of interfering on
commercial usage of MANETs, QoS support in MANET has wireless transmissions
become an unavoidable task. This article focuses on the basic
concepts of QoS, Metrics used, Challenges and an overview
about Qos routing metrics and design considerations. 1.1 Issues
Keywords [Ste et al 04] has stated the following issues that can occur in
MANET:
MANET, QoS, Routing, Constraints, Multimedia, PH-AODV,
AODV. 1. Unpredictability of environment: Ad Hoc networks may
be deployed in unknown terrains, hazardous conditions, and
1. INTRODUCTION hostile environments where tampering or the actual
MANET is a re-configurable wireless network which has a destruction of a node may be happening. Depending on the
transient infrastructure. They are self-created and self- environment, node failures may occur frequently.
organized. Mobile nodes are characterized with less memory,
power and light weight features. Moreover the nodes in 2. Unreliability of wireless medium: Communication
MANET can join or leave the network anytime, and hence through the wireless medium is unreliable and subject to
links in a route may be temporarily unavailable thus making errors. The quality of the wireless link may be unpredictable
the route invalid. Nodes that are in transmission range of each due to varying environmental conditions. Since the nodes are
other are called neighbors. Neighbors can communicate resource-constrained, it is very difficult to have a reliable
directly to each other. However, when a node needs to send communication.
data to a non-neighboring node, the data is routed through a 3. Resource-constrained nodes: Nodes in a MANET are
sequence of multiple hops, and the intermediate nodes act as fully battery powered. The battery may get drain soon, as
routers. battery power is used to transmit and receive information
from nodes. They also have limited storage and processing
Figure 1: Structure of MANET capabilities.
4. Dynamic topology: The topology changes frequently due
to the mobility of nodes. As a result new links between nodes
are created and some links that are in existence gets break.
As a result of these issues, MANET is prone to various types
of faults which include:
1. Transmission errors: Occurs due to the unreliability of the
wireless medium and the unpredictability of the environment.
2. Node failures: Happens due to different types of harmful
conditions in the environment. The network will be dropped
out either voluntarily or when the energy gets drain.
3. Link failures: Failure of nodes as well as changing
environmental conditions may cause links to break between
nodes.
4. Route breakages: Due to node/link failures and/or
node/link additions to the network, there exists a change in the
network topology. Thus the route becomes out-of date and
thus incorrect. Packets that were forwarded through stale

5 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security
National Conference on Research Issues in Image Analysis & Mining Intelligence (NCRIIAMI-2015)

routes may either eventually be dropped or be delayed based Channel Contention In order to discover network
on the network transport protocol. topology, nodes in a MANET must communicate on a
common channel which introduces the problems of
5. Congested nodes or links: Certain nodes or links may interference and channel contention.
become over utilized due to the dynamic topology of the
network and the nature of the routing protocol thus leading to Limited Device ResourcesEven though mobile devices are
either larger delays or packet loss. becoming increasingly powerful, such devices generally have
less computational power, less memory, and a limited
The overview of the further paper is as follows: Section 2 (battery) power supply, compared to devices such as desktop
narrates the outline of QoS, Section 3 briefs about QoS computers typically employed in wired networks. This factor
routing protocols, Section 4 deals with the Review of has a major impact on the provision of QoS assurances.
literature and Section 5 provides Conclusion.
2.4 QoS Constraints
2. QUALITY OF SERVICE [Aur 02] has identified the following QoS constraint types
2.1 QoS Definition and also stated the various parameters that can be used in each
[Mas] QoS for a network is measured in terms of the category, as listed below:
guaranteed amount of data that is transferred from one place
Table 2: QoS constraints
to another during a certain time. The QoS is identified as a set
of measurable pre-specified service requirements such as Constraints type Parameter used
delay, bandwidth, probability of packet loss, and delay
variance (jitter). [San et al 12] The goal of Qos is to achieve a Time constraints / Additive Delay, jitter and Cost
more deterministic network behavior, so that the information constraints
carried out by the network can be better delivered and Space constraints System buffer
network resources can be better utilized.
Frequency constraints / Network/System
2.2 Why QoS is needed and difficult? Concave / Convex bandwidth
MANET has special features like autonomous architecture, Constraints
distributed operation, multi-hop routing, reconfigurable
topology, fluctuating link capacity, and light weight terminals. Multiplicative constraints Loss Probability
Certain issues such as security, routing, reliability,
Reliability constraints Error rate
internetworking, and power consumption are to be considered
while designing MANET, because of the shared nature of the
wireless medium, node mobility, and battery limitations.
2.5 QoS parameters
Due to the rapid expansion of multimedia technology and real [Ama et al 13] has pointed the various QoS parameters that
time application, QoS is needed. To support QOS, the link are suitable for different areas:
state information such as delay, bandwidth, cost, loss rate and
Table 3: QoS parameters
error rate in the network should be available and manageable.
However, getting and managing the link state information in Areas Parameters used
MANET is very difficult because the quality of wireless link
is subject to change with the surrounding circumstances. Due Multimedia Jitter, delay, Bandwidth
to the bandwidth constraint and dynamic topology of Military Security
MANET, support of QOS for the delivery of real-time
communications in MANET is a challenging task. Networks (Routing) Responsiveness, Throughput,
Capacity, Propagation delay,
2.3 Challenges Round trip time, Delay variation
Providing better QoS in MANET is challenging due to the (Jitter), Maximum Transmission
following: [Laj et al 07] Unit, Bandwidth Delay Product
Node Mobility: In MANET, the topology is highly dynamic
in nature due to the mobility of nodes. Since the topology has
short lives, routing overhead is high as frequent updation is to 2.6 Single constrained Vs Multi constrained QoS
be carried out, to allow the data packets to be routed to their
destinations. Also, due to the node mobility, packet loss
Metrics
increases which in turn affect the end to end delay. In earlier days, most of the routing protocols have
concentrated on providing the QoS parameter Throughput
Lack of Central Control: In MANET, Networks can be
service only, which had received success in many aspects but
setup at any place and time without any pre-existing
they do not always perform the best. In CEDAR, bandwidth
infrastructure and provides access to information and services
was the only QoS parameter used in routing. But there are
regardless of geographic position. This makes it difficult to
certain areas like Multimedia applications which require multi
have any centralized control. Hence the controlling activities
QoS parameters such as Jitter, delay, cost etc, forced the trend
have to be distributed among the nodes, which require a lot of
to move from single constrained routing to multi constrained
information thus increasing the routing overhead.
routing. The main function of Multi constrained routing is
Unreliable Wireless Channel Due to interference from tofind a feasible path that satisfies multiple constraints
other transmissions, thermal noise etc., the wireless channel is simultaneously, which is a big challenge in MANET due to
prone to bit errors. This makes it impossible to provide hard the dynamic topology. Example QMRPD (QoS Multicast
packet delivery ratio or link longevity guarantees. Routing protocol for Dynamic group topology), GAMAN
(Genetic algorithm based routing for MANETs), HMCOP
(Heuristic Multi constrained Optimal path). [Sam et al 12]

6 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security
National Conference on Research Issues in Image Analysis & Mining Intelligence (NCRIIAMI-2015)

2.7 Hard QoS Vs Soft QoS approach 2. Maintenance problem: When network topology changes,
new routes if available has to be chosen, or can be found
If the Qos requirements of a connection are guaranteed to be
quickly in such a manner it supports existing QoS
met for the entire session, the QoS approach is termed as Hard
requirements.
QoS approach. In MANETs, it is very difficult to provide hard
QoS approach due to the limited bandwidth, energy constraint 3. Variable resource problem: Whenever there is a change
and dynamic topology. If the QoS requirements are not in the route or in link characteristics, it addresses how to react
guaranteed for the whole session, then it is termed as Soft to changes in available resources within the route.
QoS approach and most of the protocols provide only soft
QoS. 3.4 Layered Architecture of QoS
According to [See et al 12], the layered view/architecture of
3. QoS ROUTING quality of service contains 3 parts as given in Figure 2.
Qos routing is a routing process that guarantees to support a
set of QoS parameters during establishing a route. The QoS User
routing in MANET need to support the multimedia real-time Application
communication like video-on-demand, news-on-demand, web Network
browsing, traveler information system etc. These applications
require a QoS guarantee not only over a single hop, but on the
entire wireless multi-hop. [See 12] User
Application
3.1 Goals of QoS Routing
The goal of QoS routing is two fold: a) Selecting network Network
paths that have sufficient resources to meet the QoS
requirements and b) Achieving global efficiency in resource
utilization. [Shi et al 99] Figure 2: Layered view of QoS

3.2 Problems of Qos Routing 3.5Metrics used to specify QoS Requirements


The following are the problems faced during QoS routing in [Laj et al 07] The following is a sample of the metrics
MANET as according to [Kui et al 12] commonly used by applications to specify QoS requirements
to the routing protocol. Consequently, they may be used as
1. Due to the limited bandwidth in MANETs, overhead of constraints on route discovery and selection.
QoS routing is too high because the mobile host should
have some mechanisms to store and update link state Minimum Required Throughput or Capacity (b/s) the
information. The benefit of QoS routing is to be balance desired application data throughput.
against the bandwidth consumption in MANET. Maximum Tolerable Delay (s) usually defined as the
2. Because of the dynamic nature of MANET, maintaining maximum tolerable end-to-end (source to destination) delay
the precise link state information is very difficult. for data packets.
3. A feasible path once established is no longer in exists,
due to the mobility or power depletion of the mobile Maximum Tolerable Delay Jitter the difference between
hosts. Hence QoS routing should quickly find a feasible the upper bound on end-to-end delay and the absolute
new route to recover the service. minimum delay. The former incorporates the queuing delay at
each node and the latter is determined by the propagation
3.3 Challenges in Ad hoc QoS routing delay and the transmission time of a packet. The transmission
As according to [Aur 02], the challenges found in Ad hoc QoS time between two nodes is simply the packet size in bits/the
routing are channel capacity. This metric can also be expressed as delay
Admission control - Make admission decision with time- variance.
varying link capacity.
Maximum Tolerable Packet Loss Ratio (PLR) the
Resource reservation - Guarantee the availability of the acceptable percentage of total packets sent, which are not
reserved bandwidth over shared medium. received by the transport layer, the destination node.
End-to-end delay guarantee - How to measure end-to-end 3.6 Metrics employed for Route selection and
delay in an unsynchronized network? and detection of Delay
violation.
evaluation
The following metrics are suggested by [Laj et al 07], [See et
Route failure detection and recovery - Detect route break al 12], which is employed in routing protocols for path
by lost neighbor which is too slow! and any thing better than evaluation and selection in order to improve all-round QoS or
re-discovery? (AODV, DSR) to meet the specific requirements of application data sessions.
Table 4: Metrics used for Selection and Evaluation
Low control overhead - Explicitly releases the reserved
resource each time the flow changes its route. Layers Metrics Used for Route Protocol
selection Evaluation
The other important problems in MANET when providing Metrics
QoS are:
Network Throughput, End-to-End Bandwidth,
1. Routing problem: To find a loop-free route from the delay, Node buffer space, Jitter, Latency,
source to the destination with the requested level of QoS.
Delay Jitter, Packet Loss Packet loss
Route selection strategies include power aware, level of the
Ratio, Route lifetime,
signal strength, link stability and the shortest path.
Energy expended per
packet

7 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security
National Conference on Research Issues in Image Analysis & Mining Intelligence (NCRIIAMI-2015)

Link and Delay, Link reliability, Normalized Algorithm Parameter Simulation Outcome
MAC Link stability, Node MAC load, considered result
relative mobility MAC energy Parameters
efficiency discussed
Physical Signal-to-interference Enhanced Hop count, Packet Reduced
layer ratio, Bit error rate, Node AODV for end-to-end delivery processing
residual battery charge - providing delay, ratio, Routing overhead
QoS of Bandwidth overhead for and
Transport/ Session Multimedia audio, video Complexity
Application acceptance application and data
Layer - /blocking ratio, in MANET
Session
completion / A QoS- Bandwidth, Average Found the
dropping ratio Based DSR Link delay, throughput, most feasible
for Signal packet route
Supporting strength delivery
Selected the
3.7 Network Resources needed to provide QoS Multimedia ratio, lower
Services in packet loss most stable
Network resource is required to perform a task and is
rate and link thus
consumed during performance as correctly pointed by [Laj et
Mobile Ad lower routing reducing
al 07. Therefore, the following is a list of network resources:
Hoc overhead. route
Networks maintenance.
Node computing time while mobile devices are being
manufactured with increasingly powerful processors, they are
still limited in computing power. As communication protocols
usually do not place a heavy burden on the processor, it is the
4.2 Routing based QoS
least critical resource. [Ash et al 14] developed power hop based Ad-Hoc On
Demand Distance vector (PH-AODV) which used node power
Node battery charge it is the most critical resource and hop count parameters to select the best routing path. This
because if a nodes battery is drained, it cannot function at all. method checks the source route table to find a valid route if
Node failures can also cause network partitioning, Hence exists or otherwise starts a route discovery process. Aimed to
power aware and energy efficient MAC and routing protocols achieve better throughput, better average end-to-end delay and
have received a great deal of research attention. better averag drop packets.
Node buffer space (memory) during a networks [Ila et al 13] proposed Quality of Service (QoS) Routing in
operation, more than one node will be transmitting at the same Mobile Ad-hoc Network (MANET) using AODV protocol:
time, or there may be no known route to another device. In Cross-Layer Approach, considers several constraints for the
either of these cases data packets must be buffered. route selection instead of considering hop count, to attain
Furthermore, when the buffers are full, any newly arriving reliable data transmission.
packets must be dropped, thus contributing to the packet loss.
[Gov et al 13] focused on the Improvement of QoS assurance in
Channel capacity Measured in bps and affects data MANET by QSIG-AODV: a QoS based signaling and on
throughput, and indirectly, delay and hence a host of other demand routing algorithm. Network layer makes use of in-
metrics too. band signaling and AODV routing protocol to find the best
path which satisfies QoS requirements. Divides the concept
4 REVIEW OF LITERATURE into 2 parts i) Qos signaling- responsible for fast resource
4.1 Multimedia based QoS reservation and restoration and ii) QoS routing to provide an
[Win et al 14] has developed an Enhanced AODV for optimal routing based on AODV protocol.
providing QoS of Multimedia application in MANET. Aims Table 6: Comparative table for Routing based QoS
in obtaining the most relevant path and also computes
multiple disjoint alternate paths from source to destination. Algorithm Concept Outcomes
When the current active path breaks, the second highest
priority will be used for packet transmission. The application PH-AODV routing Considers node Average
content is classified into Video, Audio and data. Hop count algorithm power and hop throughput,
and end-to-end delay is used as optimization parameter and count average end-
bandwidth is used as constraint parameter. to-end delay
and average
[Gee et al 12] proposed an On Demand QoS routing scheme drop packets
QoS-Based Dynamic source Routing (QDSR) protocol. This
scheme separated the data service into two groups- i) Non real Quality of Service Used Cross layer Packet
time data service where bandwidth is not so sensitive and (QoS) Routing in approach to delivery ratio,
allows delay during transmission. ii) Real time data service Mobile Ad-hoc improve delay and
where delay and bandwidth are very sensitive. Focused on Network (MANET) information packet drop
finding a route that satisfies the QoS requirements and had a using AODV sharing between
better chance for the route survival over a period of time even protocol: Cross- network and
after the node movement. Layer Approach physical layer
Table 5: Comparative table for Multimedia based QoS

8 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security
National Conference on Research Issues in Image Analysis & Mining Intelligence (NCRIIAMI-2015)

Improvement of Makes use of QoS Increased 7. [Kui et al 12] Kui Wu and Janelle Harms, QoS support in
QoS assurance in in-band signaling packet Mobile Ad Hoc Networks, University of Alberta, 2012.
MANET by QSIG- mechanism and delivery ratio,
AODV: a QoS AODV protocol Negligible 8. [Laj et al 07] Lajos Hanzoi and Rahim Tafazolli,
based signaling and congestion University of Surrey, A Survey of QoS Routing solutions
on demand routing and end-to- for Mobile Ad Hoc networks, IEEE Communications
algorithm end delay Surveys & Tutorials, Vol. 9, No. 2, 2nd Quarter 2007.

9. [Mas] Masoumeh Karimi, Quality of Service (QoS)


Provisioning in Mobile Ad-Hoc Networks, Technological
5. CONCLUSION University of American (TUA), USA.
MANET is expected to play a significant role in the
development of wireless communication systems. Such 10. [San Et Al 12] Sanjeev Gangwar, Dr. Saurabh Pal and Dr.
networks are attractive because of the following feature, that Krishan Kumar, Mobile Ad Hoc Networks: A
can be rapidly deployed anywhere and anytime without the Comparative Study Of Qos Routing Protocols, IJCSET,
existence of fixed base stations and system administrators. ISSN: 2231 0711, Vol. 2 , Issue 1, pp. 771-775, Jan 2012.
Hence, Mobile Ad Hoc networks must be able to provide the
required quality of service for the delivery of real-time 11. [See et al 12] Seema, Dr. Yudhvir Singh, Mr.
communications such as audio and video that poses a number VikasSiwach, Quality of Service in MANET,
of different technical challenges and issues due to the International Journal of Innovations in Engineering and
dynamic nature, bandwidth restriction, limited processing and Technology (IJIET), ISSN: 2319 1058, Vol. 1, Issue 3,
capabilities of mobile nodes. Thus, to provide efficient quality Oct 2012.
of service in Mobile Ad Hoc networks, there is a solid need to
create new architectures and services for routine network 12. [Shi et al 99] Shigang Chen and KlaraNahrstedt, Member,
controls. Needs mechanism to develop efficient routing IEEE, Distributed Quality-of-Service Routing in Ad Hoc
procedure for reducing power consumption, for efficient use Networks, IEEE Journal On Selected Areas In
of limited bandwidth and communication capacity and long Communications, Vol. 17, No. 8, August 1999.
live of the battery. This paper has concentrated on basic
concepts of QoS and algorithms related to Multimedia and 13. [Ste et al 04] Stephen Mueller, Rose P. Tsang, and Dipak
routing, especially in MANET. Ghosa, Multipath Routing in Mobile Ad Hoc Networks:
Issues and Challenges, Performance Tools and
6. REFERENCES Application to Networked systems, Lecture Notes in
1. [Ama et al 13] Amandeep Kaur, Dr,Rajiv Mahajan,
Computer Science, Vol. 2965, pp. 209-234, 2004.
Survey of QoS based routing protocol for MANETs,
International Journal of Advanced Research in Computer
14. [Win et al 14] WintwarOo, and EieiKhin, Enhanced
Science and Software Engineering, Vol.3, Issue 7, July
AODV for providing QoS of Multimedia application in
2013.
MANET, International Conference on Advances in
Engineering and Technology (ICAET'2014), Singapore,
2. [Ash 14] Ashraf AbiEin, Jaihad Nader, An Enhanced
March 2014.
AODV Routing protocols for MANETs, IJCSI
International Journal of Computer Science Issues ISSN
(Print): 1694-0814 | ISSN (Online): 1694-0784, Vol. 11,
Issue 1, No 1, January 2014.

3. [Aur 02] Aura Ganz, Quality of Service Provision in


Mobile Ad hoc Networks (MANET), Prepared for
TACOM Seminar, January 14, 2002.

4. [Gee et al 12] B.Geetha, K.Madhavi, A QoS-Based DSR


for Supporting Multimedia Services in Mobile Ad Hoc
Networks, International Journal of Advanced Research in
Computer Science and Software Engineering, ISSN: 2277
128x, Vol. 2, Issue 4, July 2012.

5. [Gov et al 13] Govind Kumar Jha, Neeraj Kumar; S.P


Tripathi, Improvement of QoS assurance in MANET by
QSIG-AODV: a QoS based signaling and on demand
routing algorithm, CSI Digital Research Center, 2013.

6. [Ill et al 13] G. Ilanchezhiapandian, P. Sheik Abdul


Khader, Quality of Service (QoS) Routing in Mobile Ad-
hoc Network (MANET) using AODV protocol: Cross-
Layer Approach, International Journal of Computer
Applications, ISSN 0975 8887, 2nd National
Conference on Future Computing, 2013.

9 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security
National Conference on Research Issues in Image Analysis & Mining Intelligence (NCRIIAMI-2015)

A Hybrid Model for Face Recognition and Retrieval with


Multi-scale Features using Auto Correlation Function
S.Prasath* Dr.S.PANNIRSELVAM
Ph.D Research Scholar Research Supervisor & Head
Department of Computer Science Department of Computer Science
Erode Arts & Science College (Autonomous) Erode Arts & Science College (Autonomous)
Erode, Tamilnadu, India Erode, Tamilnadu, India
softprasaths@gmail.com pannirselvam08@gmail.com

ABSTRACT 2. RELATED WORK


Face recognition and retrieval is a widely used M.Karg et al., [3] investigates recognition of affect
biometric application for identification and security purpose. in human walking as daily motion will affect recognition at
Various methods has been proposed for face image distance. Principal component analysis, kernel PCA and
recognition and each methods have advantage as well as linear discriminant analysis are applied to kinematic
disadvantages it makes the inefficient due to complexity of parameters are compared for feature extraction. LDA in
process and other issues affects performance of the existing combination with naive Bayes leads to an accuracy for
system. In this paper presented a hybrid model for face person-dependent recognition affective states based on
recognition and retrieval with multiscale features using auto- observation dominance are better recognizable in walking.
correlation function. The feature set is generated based on the O. Toygar et al., [4] presented the performances of
fiducial points features, octagon features and texture features. appearance-based statistical methods are compared for
The feature set is generated and matching is done with various recognition of colored face images. The effect of
face images and this method produces an efficient result. illumination variations is evaluated whereas input images are
Euclidean distance is used for distance measure between two partially occluded.
images. The images are taken from the standard image ORL F. Song et al., [5] addresses the feature selection
data base is considered for the experimentation FRR and FAR issue, from a viewpoint of numerical analysis. PCA has the
has been compute for performance evaluation. This proposed potential to perform feature selection are able to select a
approach is compared with the existing methods and it number of important individuals from all the feature
provides better results. components. This method assumes that different feature
components of original samples have different effects on
Keywords: LBP, GLCM, OCT, OCR, FPF, HFE, ACF, feature extraction result and exploits the eigenvectors of the
FRR, FAR. covariance matrix of PCA to evaluate the significance of each
feature component of the original sample. The proposed
1. INTRODUCTION method takes a number of eigenvectors are used for
Face recognition and retrieval is an essential task in reasonable scheme to perform feature selection. The proposed
the face recognition system and accomplish with various types method is able to reduce the dimensionality of original
of features are in face such as eye, eyebrow, nose and mouth. samples also not decrease in the recognition accuracy.
The octagon features in face recognition system are B.Poon et al., [6] presented a PCA based method to
considered with triangle and rectangles. A hybrid model for build a face recognition system to identify the characteristics
multiscale feature extraction has been proposed in the with the number of training and test data is varied in noise,
establishment of feature sets with various features such as blurriness to increase the number of signature on images
fiducial point features, texture features and octagon features. increasing the recognition rate. However, the recognition rate
Various types of face features which can be extracted using is increase in the calculation of covariance matrix and less
different methods are presented. number of individuals is supposed to be recognized.
The extraction of facial features such as eyes, H.Wang et al., [7] developed an improved PCA face
eyebrows and mouth from static images contain faces. In recognition algorithm based on Discrete Wavelet Transform
general, the term facial feature extraction is widely used when and Support Vector Machines is presented. The face images to
static images are the concern and tracking is used for referring form low frequency sub images by extracting the low
the process of continuously extracting and tracking the frequency component. PCA method is used to obtain the
features from video sequences. The usage of retrieving facial characterizations of sub images are extracted and eigenvectors
features present in the image analysis is described. Based put into the SVM classifier for training and recognition. The
upon the extracted fiducial point features, an image retrieval experimental results indicate that algorithm reduces the
system is proposed. computational quantity has deduced a lot.
Kin-Man Lam et al., [8] developed a head boundary
is first located in a head-and shoulders image are discussed.
The approximate positions of the eyes are estimated by means

10 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security
National Conference on Research Issues in Image Analysis & Mining Intelligence (NCRIIAMI-2015)

of average anthropometric measures. Corners, the salient 3.1 Local Binary Pattern (LBP)
features of the eyes, are detected and used to set the initial Texture is a term that characterizes the contextual
parameters of the eye templates. Based on the corner positions property of an image [2]. A texture descriptor can characterize
locate the templates in relation to the eye images accurately an image as whole alternatively it can also characterize an
and greatly reduce the processing time for the templates. image locally at the micro level and by global texture
Yuille et al.,[9] presented a model for the detection of description at the macro level. LBP method is used to label
face features using deformable templates. The face features every pixel in the image by thresholding the eight neighbors
are described by a parameterized template for an energy of the pixel with the center pixel value. If a neighbor pixel
function is defined which connect edges, peaks and valleys in value is less than the threshold then a value of 0 is assigned
the image intensity to the corresponding template properties. otherwise it is 1.
The template interacts dynamically with the input image by The original LBP operator has been used
manipulating its parameter values to minimize the energy extensively for texture discrimination, demonstrating
function. excellent results and good robustness against rotation. The
Yunhui Liu et al., [10] presented a face normalization operator labels the pixels of an image by thresholding a 33
method based on the location of eyes. The face has been neighborhood of each pixel with the center value and
detected based on the boosted cascade of simple Haar considering the results as a binary number. The histogram of
features. This method detects the position and the distance the labels can be used as a texture descriptor [1]. The
between the face and the eyes with the orientation properties. limitation of the fundamental LBP operator is its small
S.T.Gandhe et al., [11] developed a contour matching neighborhood which cannot capture dominant features with
based on face recognition system is proposed. The shape of large scale structures.
the contours is affected by the tilting a face though the effect 3.2 Gray Level Co-occurrence Matrix
are not examined. It is important problem and the threshold One of the simplest approaches for describing
values automatically according to the characteristics of the texture is to use statistical moments of the intensity histogram
image. The algorithm needs to be tested for larger variation of of an image. Using a statistical approach such as co-
database. The face similarity indicator is found to perform occurrence matrix will help to provide valuable information
satisfactorily in adverse conditions of exposure, illumination, about the relative position of the neighbouring pixels in an
contrast variations and face pose. image. The Gray Level Co-occurrence Matrix is a tabulation
S.T.Gandhe et al., [12] implemented different method of how often different combinations of pixel brightness values
of face recognition namely Principal Component Analysis, occur in an image. It was developed by Haralick.
Discrete Wavelet Transform Cascaded. The feasibility of 3.3 Proposed Hybrid Model
these algorithms for human face identification is presented The proposed hybrid model with multiscale feature
through experimental investigation. In contour matching extraction is the combination of Auto correlation function,
recognition rate and recognition time per image is very high. Fiducial Point Feature [FPF] and octagon model. Each
To rescale the energy function in Hopfield network to avoid technique constitute in the proposed model is used for
the spurious states and to improve the recognition rate. The different processes. FPF based technique is used for
algorithm needs to be tested for large variations of pose. segmentation of face image in the generation of features, the
Various literatures on the previous methods has been auto correlation function is used for the extraction of texture
analyzed which are relevant to the proposed research. Even feature and finally the octagon model is used to extract the
though there are various models and techniques have been triangle and rectangular face image based features.
developed, still there is lot of issues being not rectified in face Input Image
recognition. In order to overcome these issues, it is necessary
to propose a hybrid model for face recognition and retrieval
with multiscale features to achieve better results and accuracy.
FTex generation using FFPF generation with Feature set FOCT
From the analysis there are various drawbacks in the existing Texture feature with
Auto Correlation fiducial point features
systems are identified. Some merits and demerits make the Eye Rectangle
function
system anonymous. The parameter such as Computational
cost, Processing Time, complexity in process it makes the
Eyebrow Triangle
existing system more complex. In order to overcome the Nose
issues in the existing system there is a need to propose a
Mouth
hybrid model which is organized in a simplified way to
accomplish the face recognition and retrieval efficiently.
FHFE = { FTex, FFPF, FOCT }
3.METHODOLOGY
In order to overcome the limitations in the existing Image Matching
methodology there is a need to proposed method for face
recognition and retrieval with multiscale features. Image Retrieval

Fig.1.1 Process Flow of Proposed Hybrid Model

11 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security
National Conference on Research Issues in Image Analysis & Mining Intelligence (NCRIIAMI-2015)

The above Fig.1.1 shows the process flow of the proposed In this step the triangle and rectangle are converted
face recognition and retrieval process with the generated into the octagonal shaped pixels. The feature selection based
multiscale features of face. The proposed model generation of on edges of face image is an interesting topic in research
feature set makes the recognition and retrieval efficient. In community and the outer edges are considered for extract. The
order to overcome the issues in the existing system the estimated feature set as in the following equation 3.2.
proposed a hybrid feature extraction model includes the
maximum features and provides better results.
{
FOCA = FOCC + FOCT + FOCR (3.2) }
The proposed model take place in three sets of features // OCA- Octagon Area//
and is presented below, Where
Estimation of Octagon based features. Area= Center + Triangle + Rectangle
Extraction of Texture feature using auto The area of one of the four triangles
correlation function. 1 1 1 2
Area of triangle= b.h = m.m = m
Extraction of Fiducial point features. 2 2 2
The above mentioned features are extracted in the The area of one of the four rectangles
generation of feature set using different techniques more Area of rectangle= s.m
suitable for feature extraction and are presented in the A side length of m to find the area of entire regular octagon.
following sections. To generate the area expression below by summing smaller
areas within the octagon.
3.3.1 Octagon Feature 1
FO C A = (s ) + 4 (
2 2
) + 4 ( s. m )
Octagon is one of the shapes of polygon. It is a 2 m
closed plane shape and also enclosed with the eight straight 1 s 2 s
=
2

sides are same. Octagon has eight same angles. In octagon, s + 4(


2
(
2
) ) + 4 ( s.
2
)

there are eight vertices and eight edges. The extracting edges =
2
2s ( 1+ 2)
from the input image the closed edges are considered for the
The octagon will be rectangle shapes and triangles
extraction of edge orientation.
are identified and computed as represented in the following
equation,
FOCT = {FOCT1 + FOCT 2 + FOCT 3 + FOCT 4 }

FOCR = {FOCR1 + FOCR 2 + FOCR3 + FOCR 4 }

Finally,
{
FOCTF = FOCE , FOCT , FOCR } (3.3)
The octagon based features are extracted and
Fig. 1.2 Octagon with edges
experimented with the image collected from standard face
F ,F ,F ,F ,
FOCE = OCE1 OCE 2 OCE 3 OCE 4 (3.1) database ORL DB. The image of each size is considered for
F , F , F , F
OCE 5 OCE 6 OCE 7 OCE 8 this experiment as shown in Fig.1.4.
//F-Face,OCE - Octagon Edges//
The first divide the regular octagon into parts to find the area
OCE1,OCE2,OCE3,OCE4,
Area=
OCE5,OCE6,OCE7,OCE8
Identify the area of regular octagon and compare with the
triangle and octagon rectangle methods. To find area of
octagon sum all the areas of interior shapes. The octagon will
101_1 102_1
be regular 4 rectangle shapes and 4 triangle shapes are
identified and computed.
To calculate Length S
M =
2
To calculate interior shapes center = S2

103_1 104_1
Fig.1.4 Octagon features of images extraction

In the proposed work a multiscale feature set has


been established with three different types of feature set. The
Fig.1.3 Octagonal Features of Triangle and Rectangle fiducial point feature such as eye, eyebrow, nose and mouth
features are extracted. Then the texture based features are

12 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security
National Conference on Research Issues in Image Analysis & Mining Intelligence (NCRIIAMI-2015)

extracted with the auto-correlation function [PANN09] and considered. The features used for fiducial points are extracted
finally, the octagon based features for face images such as separately and matched.
number of triangle and rectangles are extracted with the 4.1 Generation of Fiducial Point Feature
octagonal algorithm. The extracted three feature sets are The segmented facial image is considered for the
concatenated together in the generation of multiscale features feature extraction. The features used to establishing the face
for better face recognition and retrieval and are presented in fiducial point based feature set are eye, eyebrow, nose and
the following sections. mouth. Here, facial fiducial point feature set generation FPF
are considered for the image recognition and retrieval. The
3.3.2 Auto Correlation Function estimated and stored in a feature set as the following equation
In statistics, the auto correlation function of a 3.7.
random process describes the correlation between the FFPF = {FEB , FE , FM , FN } (3.7)
processes at different points in time.
Here, face fiducial point feature set generation are
considered for the recognition and retrieval. The establish a
3.3.2.1 Feature Extraction with Auto Correlation
feature set of face image recognition and retrieval as presented
Coefficient
in equation 3.18.
In this section the extraction of feature vectors of the
12,14,148,170,12,14,149,149,12,14,162,199,
input image with the auto correlation function. The gray scale FFPF =
12,14,158,121,12,14,119, 29,12,14,164,168
input image Ii of size (MN) from the image database is
considered for the extraction of feature vectors. Let the input The extracted feature set block of the target image is
image Ii be divided into the K-sub images S1, S2,.... Sk, each of obtained. These procedures are repeated for all the blocks and
size (nn), where n<=M,N using the equation obtain the feature set for each image in the IDB.
4.2 Generation of Feature Set with Texture
k = 2 2( n1) (3.4) In order to extract the gray values instead of binary
Where n=1,2,3,....... the feature vectors are then values are used the texture feature in the generation of feature
extracted for each of the k non-overlapped sub regions. set. The auto correlation function, feature set FTex of the input
The auto correlation coefficients that range from 0 image is established and the graphical representation of the
to 1 are computed for (i,j) directions with the positional feature set presented as below:
difference (p,q) of each block using the following equation
3.5.
F
Tex {1 2 k }
= F , F ,......., F (3.8)

n pn q Where F1,F2 and Fk denotes auto correlation feature


n2 f (i , j ) * f (i + p , j + q )
i =1 j =1 (3.5) vector of Kth block.
C ff ( p, q ) = n n
[ n p ][ n q ] f 2 ( i , j ) In addition to the above features, the generated
i =1 j =1
feature sets (FTex) of all the textured images in image database
Where P n/2 and q n/2 and n is size of the Kth
are stored in the feature database are calculated. Hence, the
block.
obtained autocorrelation features values of each block are
The obtained auto correlation coefficient values of
represented. Then the feature vectors are obtained with the
Kth block are converted into autonums with the value ranges 0
values as mentioned in equation 3.8. The extracted feature set
to 100 and are represented into feature Matrix as shown
values of the image is obtained as:
below:
1.000, 0.999,0.998,0.996,0.971,0.974,0.977,0.980,
C ff ( 0, 0 ) C ff ( 0,1) L C ff ( 0, q 1) FTex =

C ff (1, 0 ) C ff (1, 2 ) L C ff (1, q 1)
0.945, 0.946, 0.948, 0.950, 0.923, 0.92, 0.927, 0.929
Fk = .. (3.6)
M M M Here, the auto correlation function feature set

C ff ( p 1, 0 ) C ff ( p 1, 0 ) L C ff ( p 1, q 1) generation is considered for the recognition and retrieval. The
The values obtained with the feature matrix of each extracted feature set block of the target image is obtained.
block are considered as the auto correlation features. These procedures are repeated for all the blocks and obtain the
The feature extraction involves in simplifying the feature set for each of the image in the IDB.
amount of resources required to describe a large set of data 4.3 Generation of Feature Set with Octagon
accurately when performing the analysis of complex data one The feature set is generated with the octagonal face
of the major problems arises based on the number of variables image based feature vectors for all the sub images. The
involved. To analysis with large number of variables features of each blocks of the sub image and are represented
generally requires a large amount of memory and high as below,
computational cost or a classification algorithm with the
training sample and generalizes poorly to new samples. To
{
FOCTF = FOCE , FOCT , FOCR } (3.9)

generate the feature vectors some statistical measures are used Hence, the obtained octagon features values of each
for the computation. block are represented. Then the feature vectors are obtained
with the values as mentioned in equation 3.9. The extracted
4. Generation of Feature Set feature set values of the image is obtained as:
The facial feature plays an important role in face 1.000, 0.998, 0.996, 0.992, 0.989, 0.992, 0.995, 0.997,
FOCTF =
image classification, recognition and retrieval system. In this 0.972,0.976, 0.980, 0.983, 0.958, 0.961, 0.964, 0.967
section discussed about the extraction of facial features are

13 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security
National Conference on Research Issues in Image Analysis & Mining Intelligence (NCRIIAMI-2015)

Here, the octagon feature set generation is . images accepted out of database (3.13)
FAR=Noof
considered for the recognition and retrieval. The extracted Total no. of images in database
feature set of the block of the target image is obtained. 5.2 False Rejection Rate (FRR)
4.4 Generation of Feature Set with Proposed Methodology The ratio of number of correct images rejected in the
In the earlier sections discussed about the generation database to total number of images in database. The false
of feature set with the various features present in the face rejection rate is computed to evaluate the performance of the
image. The various features are such as eye, eyebrow, nose face image matching using following equation 3.14.
and mouth. Along with these features have used the texture
FRR = N o.of correct im ages rejected (3.14)
features present in the input image with the modified auto Total no. of im ages in database
correlation function. The representation of the multiscale
feature set is presented in the equation 3.10. 5.3 Image Retrieval Rate
{
FMFE = FFPF , FTex , FOCTF } (3.10) The performance of image retrieval is measured in
terms of precision(P) and recall(R) which are defined below:
Where {FFPF } the fiducial point feature set is

represented in the equation 3.7, {FTex } the feature set with


P= r
n u m b e r o f r e le v a n t im a g e s
(3.15)
=
n1
n u m b e r o f r e tir e v e d im a g e s
texture feet is represented in the equation 3.8 and {FOCTF } the
feature set of octagon feature set is represented in the equation
3.9. R= r
=
num ber of relevant im ages (3.16)
total num ber of relevant im ages in ID B
{ }
n2

FMFE = FEB , FE , FN , FM , F1, F2 , FK , FOCT , FOCR (3.11)


The extracted feature set block of the target image is The retrieval performance of the proposed scheme is
obtained. These procedures are repeated for all blocks and also measured in terms of average recognition rate by
obtain the feature set for each image in the IDB. The considering each image as a target image and the numbers of
extraction of entire process by using the equation 3.11. relevant images corresponding to the target images available
in the database are listed.
12,14,148,170,12,14,149,149,12,14,162,199,12,14,158,121,
12,14,119, 29,12,14,164,168,1.000, 0.999,0.998,0.996,
6. Algorithm
0.971,0.974,0.977,0.980,0.945, 0.946, 0.948, 0.950, The entire retrieval procedure with the proposed
FMFE =
0.923, 0.92, 0.927, 0.929,1.000, 0.998, 0.996, 0.992, hybrid model is presented as simple algorithms. In order to
0.989, 0.992, 0.995, 0.997,0.972,0.976, 0.980, 0.983, retrieve the hybrid model from databases, two phases are

0.958, 0.961, 0.964, 0.967 followed. In the algorithm-I, procedure to establish feature set
is established for each of the images. In algorithm-II, the
5. Similarity and Performance Measures image retrieval procedure that retrieves top m images from
the IDB corresponding to the target image is presented.
To find the similarity measures between the images,
6.1 Algorithm-I
various metrics are used to measure the distance between
features of the images. Some of the well known distance // generating feature sets //
metrics used in for image retrieval are presented below. In the Input : Input image
Output : Hybrid feature set
texture based image retrieval system Euclidean distance is Begin
used to find distance between the features vectors of the target Step 1: Read an input image from the image database.
image It and each image in the image database (Ii). The Step2: Generate the Fiducial Point Feature using the equation
no.3.7.
difference between two images Ii and It can be expressed as Step 3: Select the fiducial feature obtained in Step2.
distance d between the respective feature vectors Fs(Ii) and Step 4: Perform Procedure Exact_tex () of the image.
Fs(It). From the given input image Ii and target image It the Step 5: Perform Procedure octagon () to generate feature set.
Step6: Generate the multiscale feature set with the feature set
Euclidean Distance d E ( Fs ( Ii ) , Fs ( It ) ) is calculated as, obtained from Step2 to Step5.
Step 7: Repeat Step1 to Step6 for all images in the selected data set.
( Fs ( I ) , Fs ( I )) = ( Fs ( I ) Fs ( I ))
i =n 2
dE i t i t
(3.12) Step 8: Stop
6.2End
Algorithm-II
i =1

Where Fs(Ii) is the feature set of the input image Ii,


Fs(It) is n-dimensional feature vector of the target image It // Retrieve top relevant images //
Input : Target image and images from IDB
respectively.
Output : List top M relevant images corresponding to the target
image.
5.1 False Acceptance Rate (FAR) Begin
The probability of that system incorrectly matches Step 1: Select the target image from selected image dataset.
Step 2: Perform Step1 to Step6 in algorithm-I to generate
with images stored with input image database. The false the feature set.
acceptance rate is computed to evaluate the performance of Step 3: Perform Procedure Eucli_Dist().
the face image matching using following equation 3.13. Step 4: Sort the Computed results of Euclidean Distance in
ascending order.
Step 5: Retrieve top M relevant face images.
End

14 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security
National Conference on Research Issues in Image Analysis & Mining Intelligence (NCRIIAMI-2015)

// generating texture features//


Procedure Exact_tex ()
{
Step 1: Read an input image from the image database.
Step 2: Partition the input image into K-non-overlapped
blocks.
Step 3: Perform Procedure auto_corr ( ).
Step 4: Repeat Step2 through Step3 for all blocks of the input
image. 101_1 102_1
Step 5: Generate feature set FTex as mentioned in equation
3.8.
Step 6: Store the feature set into the feature database.
Step 7: Repeat Step1 through Step6 for all the images in IDB.
}

// procedure octagon //
Procedure octagon ()
{
Step 1: Read the image for octagon. 103_2 104_1
Step 2: Initialize edge orientation by using equation 3.1. Fig.1.5 Sample Images considered for experimentation
Step 3: Bound the closed edges with rectangle and triangle grid.
Step 4: Eliminate the edges present in the rectangular if area less
than of the bounded rectangular are and count no. of Each image in the IDB is subjected to the above
rectangle and no. of triangle for feature extraction. mentioned extraction process with the proposed model. The
Step 5: Calculate the triangle and rectangle based general proposed model coefficients are obtained as described in
feature set as discussed in section 3.3.1.
section. The procedure auto_corr( ), procedure
Step 6: Calculate the octagon features of image mentioned in
equation 3.3. octagon_feature( ) and procedure FPF_feature( ) are applied.
Step 7: Repeat Step2 through Step6 for all the input images and Then the extraction of entire process by using the values
find the feature vectors of each triangle, edge and equation 3.11. The proposed algorithm 6.1 is applied on the
rectangle.
Step 8: Return. image from the selected image database for the
} experimentation. The image 103_2 which is considered as the
input image. Initially the fiducial point features are extracted
//procedure Euclidean distance // to generate feature sets. The resultant of fiducial point features
Procedure Eucli_Dist ()
{
feature set is generated for the input image 103_2 and the
Step 1: Compute the distance measures for number of images establish a feature set of face image recognition and retrieval
from IDB with the target image using the equation 3.12. mentioned in equation 3.7.
FFPF ={FEB, FE, FM , FN}
Step 2: Return
}

// procedure auto correlation //


12,14,148,170,12,14,149,149,12,14,162,199,
Procedure auto_Corr () FFPF =
{ 12,14,158,121,12,14,119,29,12,14,164,168
Step 1: Read an input image from the image database.
Step 2: Compute the positional difference by using equation
3.5. After the execution of step 2, each image in the IDB
Step 3: Establish feature Matrix with the auto correlation is subjected to the above mentioned texture extraction process
features for all the K sub regions of the input image with the auto correlation model. The image under analysis is
as discussed in section 3.3.2.
Step 4: Return partitioned into k blocks, each block of size (tm tn). Then the
} auto correlated coefficients are obtained as described in
section 4.2. The procedure auto_corr is applied. Then the
feature vectors are obtained with the autonum values as
7. Experimentation & Results mentioned in equation 3.5. For illustration, when the input
The proposed hybrid model with the auto image is 103_2, shown in Fig.1.5 presented below the auto
correlation function is experimented with the image collected correlation feature matrix (Ftex). Then the extraction of texture
from standard face database ORL DB. The image of each size feature with the auto correlation model. The auto correlated
is considered for this experiment and of the original image has coefficient is described in section 3.3.3. The resultant of
segmented with fiducial point features to generate features autocorrelation texture feature set is generated. The feature
sets and the pixel values in the range 0-240. Some of the vectors are obtained with the values as mentioned in equation
sample images considered in the experimental is presented in 3.8.
the Fig.1.5. F
Tex {
= F , F ,......., F
1 2 k }
1.000, 0.999,0.998,0.996,0.971,0.974,0.977,0.980,
FTex =
0.945, 0.946, 0.948, 0.950, 0.923, 0.92, 0.927, 0.929

15 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security
National Conference on Research Issues in Image Analysis & Mining Intelligence (NCRIIAMI-2015)

Finally the octagon feature such as number of


triangle of face and number of rectangle of face are computed
as discussed in the section 4.3. The resultants of octagon
feature set are generated for the input image. Then the feature
vectors are obtained with the values as mentioned in equation
3.9.
{
FOCTF = FOCE , FOCT , FOCR }
103_3 103_7
1.000, 0.998, 0.996, 0.992, 0.989, 0.992, 0.995, 0.997,
FOCTF =
0.972,0.976, 0.980, 0.983, 0.958, 0.961, 0.964, 0.967

The feature set obtained from phase I, II and phase


III are concatenated with the extracted features to generate the
multi scale feature set for the image recognition and retrieval.
The master feature set values are obtained by using equation
3.10.
{
FMFE = FFPF , FTex , FOCTF } 103_5 103_6

The values obtained are as follows with the


multiscale feature set by using the equation 3.10.

12,14,148,170,12,14,149,149,12,14,162,199,12,14,158,121,
12,14,119, 29,12,14,164,168,1.000, 0.999,0.998,0.996,

0.971,0.974,0.977,0.980,0.945, 0.946, 0.948, 0.950,
FMFE =
0.923, 0.92, 0.927, 0.929,1.000, 0.998, 0.996, 0.992,
0.989, 0.992, 0.995, 0.997, 0.972,0.976, 0.980, 0.983,
101_1 103_1
0.958, 0.961, 0.964, 0.967

After establishing the feature set of the input image,


the feature set is to be generated for the target image given by
the user for recognition and retrieval purpose. One such target
image 103_2 is shown in Fig.1.5 and the above mentioned
steps performed for the target image. The generated feature
set of the target image 103_2 is compared with the feature sets
in the feature database using the Euclidean distance measure
as discussed in the section 5. For the image retrieval the 104_4 103_9
proposed model is applied and the retrieval results obtained
are evaluated in terms of precision and recall presented in Fig 1.6 Retrieval results obtained with the proposed
Table 1.2 and the retrieval of images for the target image Method
103_2 is shown in the following Fig.1.6.
From the above Fig.1.6 shows the experimental
results of the proposed model. The image 103_2 is considered
Target Image 103_2
for the matching and it gets match with the 10 same face
images and 1 face image with different face image has been
matched. Similarly, in case of retrieval the same input image
gets matched with image set of same face image and the
considered input image is ranked in top 1 and the relevant
images of the same face images are listed in top 10 among the
top 8.
In order to evaluate the performance of the proposed
system with the existing system the FAR and FRR are
computed with the selective image sets. The performance
evaluation is computed using the equation 3.14 and equation
3.15. The obtained results of the face image with the proposed
model is tabulated in the following Table 1.1 and the
performance are evaluated with the existing the LBP and
GLCM model.
103_2 103_8

16 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security
National Conference on Research Issues in Image Analysis & Mining Intelligence (NCRIIAMI-2015)

Table 1.1 Recognition Rate of Face images From the Table 1.2 shows the precision and recall of
Model Acceptance % Rejection % the proposed model in which the proposed model provides the
recall 1.25 and precision 0.5 and AC model provides 1.10 and
GLCM 80.50% 19.5% 0.3 respectively. Hence, the proposed model is also efficient
for image retrieval. The pictorial representation of the
LBP 91.90% 8.50% retrieval performance is shown in the following chart.

FPF 93.25% 6.75%


Proposed
95.50% 4.50%
Hybrid Model

From the above Table 1.1 totally 400 images has


been considered for the experimentation with 10 face image of
the same face with different variations. The performance of the
proposed hybrid model is compared to the existing models
such as GLCM, LBP and FPF is shown in Table 1.1. It is
Fig.1.8 Performance evaluation of image retrieval
observed from the table that the proposed model has good
recognition rate for the standard face images taken into
consideration. The acceptance rate 95.50% and 4.50 %
8. CONCLUSION
In this paper, a hybrid model is proposed and
rejection rate is obtained for the proposed hybrid model
applied for the face recognition and retrieval. The estimation
respectively which is high when compared to the models such
features set are generated with the multiscale feature sets such
as FPF, LBP and GLCM with acceptance rate of 93.25% and
as fiducial point, texture based feature using auto correlation
6.75 % rejection rate is obtained, 91.90% and 8.50 % rejection
function and octagon features. The feature extraction process
rate is obtained and GLCM provides the acceptance rate of
is very lighter with maximum features present in the face
80.50% and 19.50% rejection rate respectively. The pictorial
image. The experimental result shows the efficiency of the
representation of the experimentation and its performance
proposed model. The FAR and FRR are 95.50% and 4.50%
evaluation are presented in the Fig.1.7 is presented below.
respectively. The Euclidean distance has been computed for
better recognition and retrieval. The proposed technique is
experimented and compared with existing LBP and GLCM,
FPF models. Compared to the existing model the proposed
model gives better results and efficient for recognition and
retrieval.

9. REFERENCES
[1] Xiao-rong pu, yi zhou, and rui-yi zhou face recognition
on partial and holistic lbp features , journal of electronic
science and technology, vol. 10, no. 1, 2012
[2] T.Ojala, M.Pietikainen, and D.Harwood, A comparative
study of texture measures with classification based on
Fig.1.7 Performance evaluation of proposed method featured distribution, Pattern Recognition, vol. 29, no.
1, pp. 5159, 1996.
Hence, the retrieval rate is estimated in terms of [3] M.Karg, R.Jenke, W.Seiberl, K.K.A.Schwirtz and
precision and recall. The precision and recall of the proposed M.Buss, A Comparison of PCA, KPCA and LDA for
method is presented in the following table. The performance Feature Extraction to Recognize Affect in Gait
of the existing schemes with precision and recall are also Kinematics, 3rd International Conference on Affective
obtained. For every comparison, they are also incorporated in Computing and Intelligent Interaction and Workshops,
the same Table 1.2. pp.1-6, 2009.
[4] O.Toygar and A.Acan, Face Recognition Using PCA,
Existing Methods LDA and ICA Approaches on Colored Images, Journal
Existing Method
Proposed Hybrid Model
of Electrical & Electronic Engineering, vol.3, pp.735-
Recall Precision Recall Precision 743, 2003.
[5] F.Song, Z.Guo, and D.Mei, Feature Selection Using
1.25 0.5 1.10 0.3 Principal Component Analysis, International
Conference on System Science, Engineering Design and
Table 1.2 Comparison Result of Proposed Method with Manufacturing Informatization, vol. 1, pp. 27-30, 2010.
Existing Methods

17 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security
National Conference on Research Issues in Image Analysis & Mining Intelligence (NCRIIAMI-2015)

[6] B. Poon, M.A. Amin, and H. Yan, PCA Based Face


Recognition and Testing Criteria, Proceedings of the
Eighth International Conference on Machine Learning
and Cybernetics, pp.2945-2949, 2009.
[7] Wang, J., Wu, L., He, X., Tian, J "A new method of
illuminationinvariant face recognition, In. Proc. of
International Conference on Innovative Computing,
2007.
[8] Kin-Man Lam, Hong Yan, Locating and extracting the
eye in human face images, Elsevier Science Inc, pp.771-
779, 1996.
[9] Alan.L.Yuille, Peter.W.Hallinan and David.S.Cohen,
Feature Extraction from Faces Using Deformable
Templates, International Journal of Computer Vision,
Kluwer Academic Publishers, Manufactured in the
Netherlands, vol.8, Iss.2, pp.99-111, 1992.
[10] Yunhui Liu, Ganhua Li, Xuanping Cai, Xianshuai Li,
An efficient face normalization algorithm based on eyes
detection, International Conference on Intelligent
Robots and Systems, IEEE/RSJ, pp.3843-3848, 2006.
[11] S.T.Gandhe, K.T.Talele, A.G.Keskar, Intelligent Face
Recognition Techniques, GVIP Journal, vol.7, Iss.2,
pp.53-60, 2007.
[12] S.T.Gandhe, K.T.Talele and A.G.Keskar, Face
Recognition using Contour Matching, International
Journal of Computer Science, vol.35, Iss.2, pp.226-233,
2008.

18 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security
National Conference on Research Issues in Image Analysis & Mining Intelligence (NCRIIAMI-2015)

An analysis of Air pollutant (SO2) using Particle Swarm


Optimization (PSO)

A. Vinayagam Dr. C. KAVITHA Dr. K. THANGADURAI


Assistant Professor Assistant Professor Assistant Professor
Govt. Arts College, Thiruvalluvar Govt. Arts Govt. Arts College,
Karur College, Rasipuram Karur
a_vinay@rediffmail.com

ABSTRACT to the earth as acid rain. Acid rain cause the damages to
An air pollutant is a substance in the air that can have adverse forests and crops.
effects on humans. A pollutant can be of natural or man-made.
Harmful emission in the air could extremely affect human 1.1 Sources of Air pollution
health. Forecasting models is essential for predicting air
quality. Air quality and its pollution monitoring services are
The following are the various factors which are responsible
provided by many countries. In this paper we describe our
for releasing pollutants in the atmosphere. They are classified
experiences with developing a personal air pollution exposure
into two main categories:
estimation system using particle swarm optimization. SO2
emissions have been an international pollution factor because
of Coal and petroleum which often consist of sulfur Anthropogenic (man-made) sources:
compounds. Here the Particle Swarm Optimization (PSO) is
used for analyzing world SO2 emission based on the global Burning of fuel results in the following,
energy conservation. The experimental results show that PSO
can provide good modeling results compared to other linear
models. Stationary sources include smoke of power plants,
manufacturing units and waste incinerators which is
General Terms a unit where the waste is burned.
Sulphur Dioxide (SO2), Particle Swarm Optimization (PSO) Mobile sources include road transport vehicles,
marine vessels, and aircraft engines.
Keywords Fumes from paint, spray, varnish, aerosol and other
Air Pollution, Right Fit solvents
Other resources, such as nuclear weapons
1. INTRODUCTION and toxic gases.
Point sources: Large, stationary sources with
Air pollution is caused by the materials present in the air such relatively high emissions, such as electric power
as biological molecules, or other harmful materials possibly plants and refineries.
causing disease, death to humans, and damage to other living Nonpoint sources: Smaller stationary sources such
organisms. According to the WHO report, air pollution causes as dry cleaners, gasoline service stations and
the death to millions of people worldwide.[1] Sulfur residential wood burning. May also include diffuse
oxides (SOx) - particularly sulfur dioxide (SO2) is produced stationary sources such as wildfires and agricultural
by various industries. Coal and petroleum combustion tilling.
generates sulfur dioxide (SO2). Further oxidation of SO2, On-road vehicles: Vehicles operated on highways,
usually in the presence of a catalyst such as NO2, forms streets and roads.
H2SO4, and thus acid rain.[2] This is one of the causes for Non-road sources: Off-road vehicles and portable
concern over the environmental impact of the use of these equipment powered by internal combustion engines.
fuels as power sources. Includes lawn and garden equipment, recreational
equipment, construction equipment, aircraft and
locomotives.
Sulfur dioxide (SO2) is a gas primarily emitted from fossil
fuel combustion at power plants and other industries. The fuel
combustion in mobile sources such as locomotives, ships, and Natural sources:
other equipment also produce SO2. SO2 exposure leads
adverse impacts on the respiratory system. In recent reviews Dust from natural sources, usually large areas of
of the EPA standard has determined that even short term land with few vegetation
exposure to high levels of SO2 can have a detrimental effect Methane gas, which is emitted by animals while
on breathing function, particular for those that suffer from their digestion takes place.
asthma. SO2 also reacts with other chemicals in the air to form
tiny sulfate particles, contributing to levels of PM 2.5. SO2 also
reacts with other chemicals in the air to form acids, which fall

19 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security
National Conference on Research Issues in Image Analysis & Mining Intelligence (NCRIIAMI-2015)

2. PSO updating the new velocity. Usually, the term w decreases


linearly from 1 to approximately 0 during the process.C The
The (PSO) particle swarm optimization is a method parameters r1 and r2 are generated from a uniform random
that optimizes a problem by repeating iterations for improving sequence in the interval [0, 1].
the candidate solution. PSO optimizes a problem by having a
population of candidate solutions, here dubbed particles, and
moving these particles in the space according to mathematical In the algorithm, each individual in the population evolves
formula over the pollutants position and velocity. Each according to their own experience, estimating its quality.
movement is influenced by its local best known position but, Since the individuals are social beings, they also exchange
is also guided toward the best known positions in the space, information with their neighbors. These two types of
which are updated as better positions are found by other information correspond to individual learning (cognitive: local
pollutants. This is expected to move the swarm toward the knowledge) and cultural transmission (social: knowledge of
best solutions. the best position in the swarm), respectively. The algorithm [3]
is formulated as follows:
PSO shares many similarities with evolutionary computation
techniques such as Genetic Algorithms (GA). The system is 1. Initialize parameters such as the acceleration constants,
initialized with a population of random solutions and searches inertia weight, number of particles, and maximum number of
for optima by updating generations. Unlike GA, PSO has no iterations, velocity boundaries, initial and constrained
evolution operators such as crossover and mutation. In PSO, velocities and positions, and eventually the error limit for the
the potential solutions, called pollutants, fly through the Right fit function.
problem space by following the current optimum particles.
2. Evaluate the particles' Right fit function values, comparing
Each particle keeps track of its coordinates in the problem with each other, therefore setting the local best and global
space which are associated with the best solution (right fit) it best.
has achieved so far. (The Right fit value is also stored.) This
value is called pbest. Another "best" value that is tracked by 3. In accordance with Equations (2) and (3), calculate a
the particle swarm optimizer is the best value, obtained so far particle's new speed and position, and then update each
by any particle in the neighbors of the particle. This location particle.
is called lbest. When a particle takes all the population as its
topological neighbors, the best value is a global best and is
4. For each particle, compare the current Right fit value with
called gbest. In PSO, every step is changing the velocity of
the local best. If the current value is better, update the local
each particle toward its pbest and lbest locations. Acceleration
best Right fit value and particle position with the current one.
is weighted by a random term, with separate random numbers
being generated for acceleration toward
pbest and lbest locations. 5. For each particle, compare the current Right fit value with
the global best. If the current value is better, update the global
best Right fit value and particle position with the current one.
3. PSO Algorithm
6. If a stop criterion is achieved, then stop the procedure, and
The PSO theory assumes that in a population with size M, output the results. Otherwise, return to step 2.
each individual i, where 1 < I < M, has a present position Xi,
an associated velocity Vi, and a local best solution li In
The PSO individuals are evaluated by the following Right fit
addition, each particle has a Right fit function, which is the
function:
particle's measure of merit and evaluates the particle's
adequate solution in solving a given problem. The particle
local best is the best solution obtained by it in the current Right fit= 1 / 1+ MSE
iteration. The global best solution Xg is the best solution
among all the particles. Parameters c1 and c2 are the The Right fit function is used to evaluate the model's
acceleration constants of the particles, and influence the performance. In this study, more adapted PSO individuals
velocity at which the particles move in the search space. For have higher Right fit values, approaching a value of 1.
each iteration cycle, the particle velocities are updated
(accelerated) in the direction leading to the local and global
The termination conditions for PSO are as follows:
minima:

The maximum number of PSO iterations (100)


Vi (t) = wVi (t-1) + c1 r1 (li - Xi (t)) + c2 r2 (Xg - Xi (t)) (1)

The size of the population (swarm) reaches 60 particles


The particles' positions are updated using simple discrete
dynamics:
Training process error below 10-4
Xi (t) = Xi (t - 1) + Vi (t) (2)
The stop criteria for the training algorithm are as follows:

Where the term w is the inertia weight, and is used to balance Maximum number of iterations (1000)
the local and global search abilities of the algorithm,
controlling the influence of previous information when Validation or generalization error greater than 5%

20 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security
National Conference on Research Issues in Image Analysis & Mining Intelligence (NCRIIAMI-2015)

3.1 Nature of Data The search process begins by random distribution of the initial
PSO population which represents a random sample of the
The Particle Swarm Optimization (PSO) model was used to PSO search space. The corresponding Right fit of each
forecast the SO2 emission. The related data were used, for particle is computed. Particles with higher Right fit shall be
installing the models. The values of four energy commodities selected to produce new population which enhances the
like global oil, natural gas, coal and primary energy are features of their parents. The problem under consideration is
considered the highest. to estimate the correct parameters 1 , 1 , 2 , 2 , 3 , 3 , 4 , and
for the proposed exponential model. A data set adopted
from[5] for the model development process was used. We
3.2 Evaluation Criterion defined the parameter space and other PSO setup tuning
parameters are given in Table 2. The Right fit (i.e. quality) of
For finding the performance of the PSO model, Route Mean a particular estimate is obtained by observing the behavior of
Square (RMS), the Manhattan distance (MD), the Euclidian the system with the estimated parameters, and using the RMS
distance (ED), and the Variance-Accounted-For (VAF) were error between the actual and predicted SO2.
measured. The performance criteria for these were assessed to
measure the PSO approach. This criterion helps in evaluating SO2 = 1Oil 1+ 2 NG 2 + 3Coal 3 + 4
4 PE + (7)
the capabilities of the developed PSO model. The RMS, VAF,
ED and MD are computed as follows:

(1) Route Mean Square (RMS):


Table 1. Inputs and Output for the Exponential PSO model

RMS = n
i=1 |(yi -- ^yi)| / n (3) Inputs Oil; NG; Coal; PE

(2) Variance-Accounted-For (VAF): Output SO2

VAF = [ 1 -- var(y - ^y) / var(y) ] x 100% (4)

(3) Euclidian distance (ED): Table 2. The tuning parameters for the PSO

ED = vuut n i=1 (yi -- ^yi)2 (5)


Operator Value

(4) Manhattan distance (MD):


Acceleration constant [2.1,2.1]
MD = ( n
i=1 |(yi -- ^yi)| (6)
Inertia Weight [0.9,0.6]
where y and ^y are the observed and estimated SO2 of the
proposed model and n is the number of measurements used in Maximum no. of Iteration 1000
the experiments, respectively.
Maximum Velocity 150
3.3 Experiments and Results
No. of Particles 9
The input and output data used for the model is presented in
Search space for and [-10,10]
Table 1.
Search space for [-1000,1000]
The proposed model is an exponential model of the input
variables. The tuning parameters and search space for PSO is
given in Table 2.
The computed model parameters using PSO was found as
A Matlab Toolbox for PSO was used to develop our results given in Equation 8.
[4]. Candidate solutions (particles) in this case are just n- -
dimensional vectors of particles of the form: SO2 = 8:0639 Oil-4:2278 + 5:9998 NG 0:63625 + 4:4493 Coal-0:14562

+ 2:4769 PE1:1308 + 3:7837 (8)

Previous research work presented in [5] explored the use of


Genetic Algorithms to find the best tuning parameters for the
1 1 2 2 3 3 4 4 same exponential model. The developed model is presented in
Equation 9.

SO2 = 0.3427 Oil2.0509 + 0:9188 NG0.9115 + 0.0092 Coal0.9096

+ 0:25644 PE0.2827 + 0.3597 (9)

21 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security
National Conference on Research Issues in Image Analysis & Mining Intelligence (NCRIIAMI-2015)

4. CONCLUSION IEEE SwarmIntelligence Symposium (SIS03), pages


182186, 2003.
Air pollution is a remarkable risk factor for a number of
health conditions including respiratory infections, heart [5] H Kavoosi, MH Saidi, M Kavoosi, and M Bohrng.
problems, stroke and lung problems.[6] The health effects Forecast global carbon dioxide emission by use of genetic
caused by air pollution may include difficulty in breathing, algorithm. International Journal of Computer Science Issues
wheezing, coughing, and asthma. These effects can result in (IJCSI), 9(1), 2012.
increased medication use, increased doctor or emergency
room visits, more hospital admissions and premature death. [6] Maximilian Auffhammer and Richard T. Carson.
The most common sources of air pollution include
Forecasting the path of chinas CO2 emissions using
particulates, sulfur dioxide, ozone, and nitrogen dioxide. For
several years, PSO has been successfully applied in many province-levelinformation. Journal of Environmental
research and application areas. It is demonstrated that PSO Economics and Management, 55(3):229247, May 2008.
gets better results in a faster, cheaper way compared with
other methods. Another reason that PSO is attractive is that [7] Maximilian Auffhammer and Ralf Steinhauser.
there are few parameters to adjust. One version, with slight Forecasting the path of U.S. CO2 emissions using state-level
variations, works well in a wide variety of applications. information.The Review of Economics and Statistics,
Particle swarm optimization has been used for approaches that 94(1):172185,February 2012.
can be used across a wide range of applications, as well as for
specific applications focused on a specific requirement. In this [8] Andries P. Engelbrecht. Computational Intelligence: An
work we have given the exponential model for detecting the
Introduction. Wiley Publishing, 2nd edition edition, 2007.
global sulphur dioxide emission based on some attributes.
Future work will be concentrated on the advantages of
computing techniques to develop a mathematical model for [9] Bert Metz. Climate Change 2007-Mitigation of Climate
analyzing the global sulphur emission. Change: Working Group III Contribution to the Fourth
Assessment Report of the IPCC, volume 4. Cambridge
5. REFERENCES University Press, 2007.

[10] Reza Samsami. Application of ant colony optimization


[1] Shou Hsing Shih and Chris P. Tsokos. Prediction (ACO) to forecast CO2 emission in Iran. International Journal
modelsfor carbon dioxide emissions and the of Environment, Pharmacology and Life Sciences, Bulletin of
atmosphere. Environment, Pharmacology and Life Sciences, 2:9599,
2013.
[2] "7 million premature deaths annually linked to air
pollution". WHO. 25 March 2014. [11] A.Vinayagam,Dr.C.Kavitha,Dr.K.Thangadurai Multi
Model Air Pollution Estimation for Environmental Planning
[3] Eberhart, R.; Kennedy, J.; Proc. of Int. Conference on Using Data Mining in International journal of science and
Neural Networks, Perth, Australia, 1995. Research, ISSN(Online):23197064ImpactFactor(2012):
3.358page332335.
[4] B. Birge. PSOt - a particle swarm optimization
toolbox foruse with Matlab. In Proceedings of the 2003

22 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security
National Conference on Research Issues in Image Analysis & Mining Intelligence (NCRIIAMI-2015)

An Efficient Analysis on Semantic Approaches in various


Fields

Dr. P. Raajan, K.Aruna Devi Lydia Packiam Mettilda


Assistant professor of MCA Assistant professor of MCA Researcher
Ayya Nadar Janaki Ammal College, Ayya Nadar Janaki Ammal College, Madurai
Sivakasi Sivakasi Tamil Nadu, India
Tamil Nadu, India.
Tamil Nadu, India
raajanp99@gmail.com aruna23janu@gmail.com lydiajora18@gmail.com

ABSTRACT 2. REVIEW OF THE LITERATURE


In the present scenario, information technology has played a Joel lang [1] developed a probabilistic framework model for
dominant role in all parts of human survival. Due to the rapid unsupervised role induction to solve the induction problem in
growth in information collection leads detrain of various the deduction of alternative links in order to find their
methodologies. Hence, new techniques have been emerged to canonical syntactic form. Classifiers with latent variable and
manage such data processing efficiently. In this situation trained on parse.
semantic approach leads a mastering in decision making
logical thinking based on the computational theory which Ivan Herman [6] used the methods for store and querying
penetrates into various fields such as bio metrics, networks, RDF. In this method we applied a simple way to store and
data mining etc. In this paper we made a deep analysis on the publish RDF files, retrieval of RDF files and review of RDF
role of semantic in various fields also it provides a clear cut over relational data.
road map towards the semantic which it may helpful for Chris Bizar [7] used linked data a method of publishing data
various researchers, scientist to perform well in semantic on semantic web to reusability, reduce redundancy and enable
approaches such as semantic web, semantic network, other network effects.
semantic web based technologies.
Donovonartz [8] identified four broad categories of trust
Keywords: components of semantic web and relevance of its area to
RDF, HIS, URL, Bio informatics, semantics important aspects of ongoing and future semantic web.

1. INTRODUCTION Robert Stevens [9] make an analysis on the semantic web


As the information processing is a major task in worldwide, especially for life science. It focused on the large and vast
there is rapid development of new advanced logical methods quantities of biological data with annotation or knowledge
to access the information efficiently. One of the logical available on web. It offers a solution to the long stored issue
methods used to access is semantics used in fields such as such as semantic heterogeneity.
image processing, data mining, network security, Susie Stephens [10] developed a model drug discovery with
bioinformatics etc. Data mining (sometimes called data or the oracle RDF data models and the semantic navigator.
knowledge discovery) is the process of analyzing data from Biometrics genes considered for Differentiating with patients
different perspectives and summarizing it into meaningful with different large B-Cell Lymphoma.
information.
Daniel A. Reid [11] described the imputing human description
Image processing is a method to convert an image into digital in semantic bioinformatics to differentiate how people and
form and perform some operations on it, in order to get an biometrics represent them. This imputation technique is used
enhanced image or to extract some useful information from it. to increase accuracy and robustness of automatic semantic
It is a type of signal dispensation in which input is image, like annotation of giat signature.
video frame or photograph and output may be image or
characteristics associated with that image. Usually Image 3. METHODS & CONCEPT ANALYSIS
Processing system includes treating images as two
dimensional signals while applying already set signal
3.1. Unsupervised induction of semantics
processing methods to them. A network consists of two or
more computers that are linked in order to share resources It is a semantic based approach formalized as a clustering
exchange files, or allow electronic communications. The problem in which the arguments are in the form of clusters.
computers on a network may be linked through cables, The constraints of the cluster are related to the life semantic
telephone lines, radio waves, satellites, or infrared light role. A key is associated with a standard linking which is a
beams. deterministic mapping from semantic roles to syntactic
function. Such predicates will exhibit a standard linking are
Two very common types of networks include: used in specific mapping.
Local Area Network (LAN) 3.2. Role of trust in semantic Web Vision
Wide Area Network (WAN)
It is a principle component of semantic web it is in layer
format. In which a semantic web stack has been included in a

23 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security
National Conference on Research Issues in Image Analysis & Mining Intelligence (NCRIIAMI-2015)

trust layer to ontology. It is also a mechanism to find the bioinformatics. At one level of heterogeneity the
information source. Signature and cryptography techniques programmatic access to bioinformatics resources used as web
are used for security purpose. The trust is classified into four services which the semantic web technologies, the common
types. We organize trust research in four major areas:
access is to distributed resources in heterogeneous platforms.
Policy-based trust: It is established using various policies The main objectives of the Semantic Web are to make facts
which are focused on managing and exchanging credentials amenable to machine processing. The Resource Description
and enforcing access policies. It generally established would a Framework (RDF) provides a common data model of triples
sufficient amount of credentials pertaining to a specific party, for this processing. An RDF triple consist of subject, predicate
and to grant certain authentications rights to the party. The (verb) and object, which enables any statement to represents
various issues in this are solved using third party to serve as in a simple and flexibly common framework. The triple names
an authority for issuing and verifying credentials. a resource using either a Uniform Resource Identifier (URI)
or a literal. In order to transform the resources to data model,
Reputation-based trust: It is established with the earlier the common naming scheme is used by forming a vast graph
interactions and performance for an entity is to provide access of descriptions of resources [4]. The Life Science Identifier
to its future behavior. It uses the history of an entitys (LSID) is a form of URI that can be used to uniquely identify
actions/behavior to compute trust, and may use referral based and version bioinformatics entries.
trust in the absence of first-hand knowledge. The
recommendations are trust decisions made by other users, and 3.4 Semantic Web in bio informatics
combining these decisions to synthesize a new one, often
personalized, is another commonly addressed problem. Due to the increased in the volume of bioinformatics related
data available on the Web. The access of data is being a
General models of trust: The models useful for analyzing tedious task for the researchers and scientists. Due to the
human and agenized trust decisions and for operational zing heterogeneous computing environment the users are used to
computable models of trust. It also describes values or factors swap between various websites for this information access.
that play a role in computing trust, and leans more on work in The Oracle RDF Data Model has generated much attention.
psychology and sociology for a decomposition of what trust This stems from the desire to be able to manage RDF data in a
comprises. It is also ranges from simple access control polices secure, scalable, and highly available environment. Users
to analyses of competence, beliefs, risk, importance, utility, have the flexibility of being able to incorporate multiple
etc. media data, such as images and text, into the RDF graphs. In
Additional benefits that includes simplified integration of
Trust in information resources: Trust in information
RDF data with other enterprise data, re-use of RDF objects,
resources on the Web has its own range of varying uses and
eliminating modeling impedance mismatch between client
meanings, including capturing ratings from users about the
RDF objects and relational storage, and easier maintenance of
quality of information and services. Web site design
RDF applications.
influences trust on content and content providers, propagating
trust over links. The new work in trust is harnessing both the The RDF infrastructure deployed was able to easily and
potential gained from machine understanding, and addressing quickly retrieve biological data that related to the probe set
the problems of reliance on the content available in the web. biomarkers. It also became possible for users to select or
The information is key to support decisions for automated search for the bioinformatics information of interest, rather
detection making. than manually collecting identifiers to enable jumping
between data sources. The faceted browsing methods used to
Trust concerns on the Semantic Web Bizer and Oldakowski
to remove all data which are not meet the filter conditions,
[5] make several claims with the Semantic Web in mind.
and able to select multiple entities for simultaneous search,
Initially, any statements contained in the Semantic Web must
and a possibility to identify clustering of genes within genes.
be considered as claims rather than facts until trust can be
established. Next, it works makes the case that it is too much
of a burden to provides trust information current. Then,
3.5 Semantic Web in Biometrics
context-based trust matters; in this case, context refers to the
Biometrics place the predominate role in all the fields of
circumstances and associations of the target of the trust
human identification system (HIS). It also inherits commonly
decision. Filtering information based on trust.
co-occurring semantic labels. This label structure occurs due
to genetic, morphological and social factors and can be
3.3 Semantic in Web Life Science exploited to improve semantic biometrics. This method
considers conceal a persons physical attributes from the
The Semantic web plays vital role broad category to achieve
camera using various set of features. These features can affect
the goals of bio-informaticians. In this area a huge quantities
the automatic semantic annotation of the biometric data,
of data would this annotations are presents on the web. These
leading to incorrect semantic labels. By utilizing the semantic
are highly distributed and heterogeneous. It has been exists at
structure we can compensate for missing visual features and
many levels. The Semantic Web technologies and the vision
correct erroneous semantic labels. The semantic structure can
provide the solution by integrating the views of

24 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security
National Conference on Research Issues in Image Analysis & Mining Intelligence (NCRIIAMI-2015)

also improve semantic annotation under ideal conditions by 4. CONCLUSION


increasing the accuracy of the latent semantic analysis In this paper, we presented an analysis on semantic approach
technique used to annotate the biometric signatures. which place a predominant role in various areas especially in
web based. During this analysis we mentioned semantic web
Biometric Signatures basis and its rules and its working principles also we have
analyzed various semantic web based application the area of
The semantic labels are grouped a set of visual features. Such classification web, networks, biometrics and bioinformatics
visual features can be of as a visual signature of the subject. It etc. Hence this paper provides a significant ideology for the
researchers to perform the in specified areas based on
cannot encompass the persons physical attributes. The latent
semantics. In future, the analysis of implementing the role of
semantic analysis method used in the identification of visual semantics in various applications in different dataset and to
features resulted in the person being labeled with a certain propose a common frame work for semantic analysis.
semantic description. The structure between the visual and
semantic features can be used to automatically label the 5. REFERENCES
people based on their physical characteristics. The visual [1] Lang, Joel, and Mirella Lapata. "Unsupervised induction
of semantic roles." Human Language Technologies: The
signature encompasses the physical features which contribute
2010 Annual Conference of the North American Chapter
to a semantic label being assigned to the subject. In absence of of the Association for Computational Linguistics.
these features is unable to connect the semantic term to a Association for Computational Linguistics, 2010.
physical trait. Due to the failure in the identification process
[2] Liu, Haishan. "Towards semantic data mining." 9th
the semantic biometrics has extended to involve in the
International Semantic Web Conference (ISWC2010).
surveillance cameras recording people in unconstrained 2010.
environments. The generation of virtual signature impacts due
[3] Steyvers, Mark, and Joshua B. Tenenbaum. "The
lighting, resolution and frame rate limitations. The biometrics
Largescale structure of semantic networks: Statistical
signatures obtained from different viewpoints will contain analyses and a model of semantic growth."Cognitive
different views of the subjects physical appearance, which science 29.1 (2005): 41-78.
may be beneficial or detrimental to certain semantic labels. In [4] Wang, X., R. Gorlitsky, et al. (2005). "From XML to
this paper structure will be used to counter this problem. RDF: how semantic web technologies will change the
design of 'omic' standards." Nature Biotechnology 23(9):
3.6 Analyses of semantic networks 1099-1103.
It is based on various semantic knowledge such as association
[5] C. Bizer, R. Oldakowski, Using context- and content-
norms and association networks. The processes that generated
based trust policies on the semantic web, in: WWWAlt.
such data differ in important ways; the resulting semantic
04: Proceedings of the 13th International World Wide
networks are similar in the statistics of their large-scale
Web Conference on Alternate Track Papers & Posters,
organization.
ACM Press, New York, NY, USA, 2004, pp. 228229.
3.6.1. Associative network
In the directed network, two word nodes x and y were joined [6] Ivan Herman Introduction to Semantic
by an arc if the cue x evoked y as an associative response for WebTechnologies,www.w3.org/2010/Talks/0622-
at least two of the participants in the database. Hence in the SemTech-IH/
undirected network, word nodes were joined by an edge if the
[7] Chris Bizar, Richard Cyganiak, Tom Health, Link data
words were associatively related regardless of associative
principles linking open data community project,
direction. The directed network is clearly a more natural
available source http://linkeddata.org/guides-data-
representation of word associations, our other networks were
tutorials.
both undirected, and most of the literature on small-world and
scale-free networks has focused on undirected networks. [8] Artz, Donovan, and Yolanda Gil, A survey of trust in
Finally the undirected network of word associations provides computer science and the semantic web. Web Semantics:
an important benchmark for comparison. Science, Services and Agents on the World Wide
3.6.2. WordNet Web, 5(2), 58-71.
The WordNet was developed by George Miller and [9] Robert stevens, Oliver boderreidir, Yves A.Lussier,
colleagues. The network contains 120,000+ word forms and Semantic web for life science,source:
99,000+word meanings. The basic links in the network are http://www.w3.org/2004/01/sn/s-ns-html
between word forms and word meanings. Multiple word
forms are connected to a single word-meaning node if the [10] Susie Stephens, David lavigna, Mike DiLascio and
word forms are synonymous. A word form is connected to Joanne Luciano, Aggregation of bioinformatics data
multiple word-meaning nodes if it is polysemous. Word forms using semantic web technology
can be connected to each other through a variety of relations
such as antonymy (e.g., BLACK and WHITE). Word- [11] Reid, Daniel A., and Mark S. Nixon. Imputing human
meaning nodes are connected by relations such as hypernymy descriptions in semantic biometrics. In Proceedings of
(MAPLE is a TREE) and meronymy (BIRD has a BEAK). the 2nd ACM workshop on Multimedia in forensics,
Although these relations, such as hypernymy and meronymy, security and intelligence (pp. 25-30). ACM.
are directed, they can be directed both ways depending on
what relation is stressed.

25 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security
National Conference on Research Issues in Image Analysis & Mining Intelligence (NCRIIAMI-2015)

An Efficient Classification using Machine Learning Approach


on Breast Cancer Datasets

D. Rajakumari
Assistant Professor,
Dept of Comp. Sci
Nandha Arts & Science
College, Erode, India
rsrajakumarid@gmail.com

ABSTRACT different types of diseases predicted in data mining namely


In large databases, there may exist critical nuggets small Hepatitis, Lung Cancer, Liver disorder, Breast cancer,
collections of records or instances that contain domain- Thyroid disease, Diabetes etc This paper analyses the
specific important information. This paper considers the Breast cancer disease patient predictions [1].
notion of identifying subsets of critical data instances in data
sets. Critical nuggets of information can take the following 2. BREAST CANCER DISEASE
form during classification tasks: small subsets of data
instances that lie very close to the class boundary and are PREDICTION PROBLEM
sensitive to small changes in attribute values, such that these
small changes result in the switching of classes. Such critical FORMULATION
nuggets have an intrinsic worth that far outweighs other
subsets of the same data set. In classification tasks, our KM-
SVM considers a data set that conforms to a certain Generally, the actual refinement is divided into two phases,
representation or a classification model. If one were to perturb called the match and the refine phases. In NMS (Natural
a few data instances by making small changes to some of their Medical Data Processing), such a natural evolution process
attribute values, the original classification model representing of nuggets can also be controlled by users.NMS allows users
the data set changes. Also, if one were to remove those data to cease, quicken, or slow down the evolution by setting
instances, the original model could change significantly. different parameters, such as initial vitality, fading rate, and
increasing rate. Besides such macro control, users can also
directly manipulate any individual nugget. For example, users
Keywords can mark a nugget as crucial, indicating that it should never be
expired from the system. They could also directly delete some
Critical Nuggets, Classification, Data Mining useless nuggets [2]. Match phase: In this phase, we aim to
match the identified nuggets with patterns around them
1. INTRODUCTION within the data space. In other words, our goal is to determine
Data Mining is the process of extracting hidden knowledge which patterns users were searching for when these specific
from large volumes of raw data. The knowledge must be new, nuggets were made. Briefly, the concept of Match is used to
not obvious, and one must be able to use it. Data mining has judge whether some data patterns or the major parts of these
been defined as the nontrivial extraction of previously patterns primarily contribute to a nugget. If it is the case, we
unknown, implicit and potentially needful information from call the nugget and these patterns matched. It shows a good
data. It is the science of extracting important information example of a match between a nugget and a cluster pattern
from high dimensional databases. It is one of the tasks in the in the dataset. The specific techniques utilized to calculated
process of knowledge discovery from the database. Data how much a nugget is matched with the patterns around it
Mining is used to discover knowledge out of data and will be described in our project [4].
presenting it in a form that is easily understand to humans. It
is a process to examine large amounts of data routinely Refinement Phase: The match phase reveals to us what type of
collected. Data mining is most useful in an exploratory patterns that a user was searching for. With this knowledge,
analysis because of nontrivial information in large volumes of we can finish nugget refinement using the two steps of
data. It is a cooperative effort of humans and computers. Best splitting (if necessary) and modification. These two steps will
results are achieved by balancing the knowledge of human make each nugget a perfect representative of a single pattern.
experts in describing problems and goals with the search It is more suitable for exploratory knowledge discovery. The
capabilities of computers. There are two primary goals of data results attained from Decision Tree are easier to interpret.
mining tend to be prediction and description. Prediction KM-SVM is a statistical classifier which assigns no
involves some variables or fields in the data set to predict dependency between attributes. To determine the class the
unknown or future values of other variables of interest. On the posterior probability should be maximized. The advantages
other hand Description focuses on finding patterns describing are one can work with the KMSVM model without using any
the data that can be interpreted by humans [1]. The Disease Bayesian methods [3]. Here KMSVM Classifiers performs
Prediction plays an important role in data mining. There are well. K-nearest neighbors algorithm (k-NN) is the one of the

26 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security
National Conference on Research Issues in Image Analysis & Mining Intelligence (NCRIIAMI-2015)

important method for classifying objects based on closest types of datas will sense a d point data vector. Let x~0 m;
training data in the feature space. It is simplest among all x~1 m; x~2 m; x~k m denote the d-point data vectors at Si;
machines learning algorithm but, the accuracy of k- NN Si1; Si2;Sik, in the set N(Si), at the mth time instant
algorithm can be degraded by presence of noisy features. This respectively. The goal is to identify every new measurement
observation is performed using training to consist 3000 arriving at Si as normal or anomalous in real-time. KM-SVM
instances with 14 different attributes. The dataset is divided determines the radius of quarter-sphere based on attribute
into two testing and training i.e. 70% of data are used for correlations in addition to the spatial-temporal correlations
training and 30 % is used for testing. The authors concluded [6].
that KMSVM algorithm performs well when compared to
other algorithms. Fig 1: Region R for KM-SVM

(i) In medical dataset both the categorical and


numeric attribute are transformed into transaction dataset. Si1

(ii) To find the predictive association rules with Si2


medically relevant attributes the search process should be Rc
incorporate with the above mentioned four constraints.
Si6
(iii) To validate the association rules the train and
Si
test approach should be used. Si3
Genetic algorithm has been used, to reduce the actual data
size to get the optimal subset of attributed sufficient for breast
cancer disease prediction. Classification is one of the
supervised learning methods to extract models describing Si5 Si4
important classes of data. Three classifiers e.g. Decision Tree, I.
KM-SVM and Classification via clustering have been used to
diagnose the Presence of breast cancer disease in patients.
Classification via clustering: Clustering is the process of
grouping same elements. This technique may be used as a pre-
processing step before feeding the data to the classifying An outlier is an observation which appears to be inconsistent
model. The attribute values need to be normalized before with the remainder of a set of data. This is the main problem
clustering to avoid high value attributes dominating the low when dealing with outliers and outlier detection methods try
value attributes. Further, classification is performed based on to solve. The problem has been approached using statistical
clustering[5]. and probabilistic knowledge distance and similarity-
dissimilarity functions, metrics and kernels, accuracy when
3. PROBLEM STATEMENT dealing with labelled data, properties of patterns and other
The system identifies the critical nuggets to measure the specific domain features. Therefore, every time they are using
criticality of data instances. Critical nuggets are identified a dataset is facing unusual data. A model is developed for
using CR score which helps in improving classification defining outlying patterns and defining efficient algorithms
accuracies in real-world data sets. The CR score values are for mining outlying patterns, which needs very limited
stored in histogram and highest score values are identified. domain knowledge and no classified information. A prototype
Using the highest CR score the critical nugget is identified in of dynamic subspace search system is proposed in [5], which
the data sets. It improves the accuracy of classification. Class utilizes a sample based learning process to effectively identify
imbalance problems have drawn growing interest recently the outlying subspaces of the given point. In [6], the authors
because of their classification difficulty caused by the developed a technique for efficiently computing the lower and
imbalanced class distributions. The data points in the network upper bounds of the distance between a given point and its kth
are represented as Si. In a neighborhood, represented as N(Si), nearest neighbour in each possible subspace to speed up the
k boundary data points are connected to a central points with fitness evaluation of their genetic algorithm for outlying
an inter-points distance of arc units. For KM-SVM based on subspace detection. There is a lot of work on outlier detection
weka, the inter-points distance is r*c, however, k datas per including statistical works [7], which use properties of data
attribute are grouped together on each points. Hence, no distribution, as well as probabilistic [8] and Bayesian [9]
communication is required for event detection based on techniques attempting to find the model of the anomalies,
temporal and attribute correlations. In case an anomaly gets however, these approaches are oriented to unilabiate data or
detected, then the data points transmit with a higher power in multivariate samples with only a few dimensions, processing
order to increase their transmission radius and communicate time is also a problem when a probabilistic method is used.
with each other. Distance based techniques [1, 10, 11] have been proposed,
these kind of methods use distance and dissimilarity functions
between attributes[10], instances[11] or series of objects[1] in
3.1 K-Means Support Vector Machine order to detect deviation in data sets, these methods perform
Learning Algorithm (KM-SVM) well but often they require parameters that are difficult to set.
For KM-SVM, the central points Si is considered to be Present approaches use some kernel properties to compute
within the malignant range rc of all other data points in N (Si). similarity between objects [12].Also clustering algorithms
The k spatially neighboring data points of Si are represented implicitly can deal with outliers [23] or de-noise a data set
by Sij, such that {j = 1, 2 k}. At each time interval m, each [15]. Principal component analysis have been used to improve
data points Si in the set N (Si) measures a d- dimensional data the detection [2, 13], by using a combination of the first PCs
vector x~m. Thus a data points equipped with d different and the last few PCs. Techniques based on residuals for

27 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security
National Conference on Research Issues in Image Analysis & Mining Intelligence (NCRIIAMI-2015)

regression estimation have been proposed [2], but accuracy is Step 2: the K-Means clustering algorithm is run on the
not the best measure to detect. Here compare our research to original data and all cluster centers are regarded as the
some relevant literature somehow related with the presented compressed data for building classifiers.
work. Some relevant papers are grouped into categories Step 3: SVM classifiers are build on the compressed data.
according to the specific task and they are intended to solve Step 4: three input parameters are adjusted by the heuristic
such as Outlier detection [14], Outlying property discovery searching strategy proposed in this paper according to a
(OPD), Subgroup discovery, emerging patterns (EPs), tradeoff between the testing accuracy and the response time.
Contrast sets, Rule-based classifiers and Association rule Step 5: return to step 1 to test the new combination of input
mining. parameters and stop if the combination is acceptable
according to testing accuracy and response time.
3.2 Existing System Advantages of proposed system:
In existing system, patterns detection and outliers has This algorithm works in two phases, each phase working in
emerged as an important area of work in the field of data any one of the two classes. Steps of critical nuggets
mining. Outliers are abnormal objects that deviate from other identification is below,
normal objects. Finding the outliers in several applications Identify the approximate boundary
such as credit card fraud detection and identifying network
The neighbourhood data are considered around.
intrusions help in identifying the outliers. The system
The boundary set. Based on KM-SVM value,
provides the idea of finding the outliers detection based on
critical nuggets are identified.
their difference in the object distance. In Classification task,
by detecting the outlier in the data set provides less accuracy Improves the classification accuracy
result obtained. The mining of outliers provides to indentify Class imbalance is overcome by fine turning
records that are different from rest of the data sets. But, the
existing work not much focuses on finding critical nuggets of
information in the data sets. The nuggets of information are
not been detected by outlier detection.

Disadvantage of the existing system:


Less classification accuracy.
Do not overcome the class imbalance problem.

3.3 Proposed Methodology


In proposed system, the KM-SVM also called as K-Means
SVM Based Critical nuggets are small collections of records
or instances that contain important information. The system
provides the critical nuggets to measure the criticality of data
instances. Critical nuggets are identified using the KM-SVM
calculation and important instance are identified using the
score value. It also helps in improving classification
accuracies in real-world data sets.
Class imbalance problems have drawn growing interest
recently because of their classification difficulty caused by the
imbalanced class distributions. Applying, a preprocessing step Table 1. (Wisconsin Breast Cancer Database)
in order to balance the class distribution for imbalanced
dataset problem, so the accuracy of the critical nugget can be
improved by the process. Further multi class problem is also Balanced collections, compared to a supervised multi-label
overcome in the proposed system. The KM-SVM algorithm classifier, the unsupervised approach can achieve comparable,
reduces support vectors by combining the critical nuggets even better, results. Also, we do not limit our study to the
rotation test clustering technique and SVM. Since the K- corpus from the standard ACE evaluation, which is
means clustering technique can almost preserve the preselected, but also investigate its performance on a regular
underlying structure and distribution of the original data, the newswire corpus. Experiments showed that bootstrapping on
testing accuracy of KM-SVM classifiers can be under control such a corpus performed better than that on a regular corpus.
to some degree even though reducing support vectors could Global inference based on the properties of such clusters was
incur a degradation of testing accuracy. This paper combines then applied to achieve further improvement. The pseudo co-
the K-means clustering technique with SVM to build testing method for event extraction, which depends on one
classifiers, and the proposed algorithm is called KMSVM. It view from the original problem of event extraction, and
is possible for the KMSVM algorithm to build classifiers with another view from a coarser granularity task. Classification
many fewer support vectors and higher response speed than and Critical Nuggets The study of observing data that differs
SVM classifiers. Moreover, testing accuracy of KMSVM at most from other data called outlier. These data are not used
classifiers can be guaranteed to some extent. The details of the to perform classification, but it may alert the user of providing
KMSVM algorithm are described as below. false rate of the process. Here introduce new concept called
nuggets, comes under the category of outlier. Nuggets may be
Step1: Three input parameters are selected: the kernel an outlier, but no assurance of having the nuggets under the
parameter y, the penalty factor C, and the compression rate outlier category. The small amount of data which is used to
Pscore (Precision Score). perform classification called critical nuggets. The group of
data to be removed from the relevant class and moved to some

28 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security
National Conference on Research Issues in Image Analysis & Mining Intelligence (NCRIIAMI-2015)

other class called criticality. The classification model taken as


M1 and from that class model, subset of data to be removed
(called N) and it form new class called M2 based on some Table 3. Accuracy, Precision and Recall
conditions. Consider the below term for the criticality, M2 =
M1 - N, M2 - New class M1 Existing class, N Collection F- Precision Recall
of data to be removed. Algorithm Dataset Measure (Positive
The idea is to find the number of attributes is very sensitive (Accuracy) Predication)
while performing small changes in a class. These attributes CR_Rotation
WPBC 9.46 1.2
Test 0.332
are moved from relevant class to some other class when the CR_Nugget
changes are performed. Find the number of attributes moving WPBC 9.56 1.45
Score 0.45
from one to another class call criticality score value (KM-
SVM). For example out of four directions, three direction data WPBC 9.45 1.32
SVM 0.49
are mover from one to another class while doing some
WPBC 9.65 1.45
changes. Criticality score value is calculated as 3/4=75. So KM-SVM 0.553
75% of data are moved from existing class new class. In large
data set get nugget score algorithm is used to find the critical
attributes which is having the chance of moving from one Table 4. Performance Accuracy Comparison for
class to some new class. This algorithm using rotation method KM-SVM
algorithm to find criticality score value in large and high
dimensional database. Critical nuggets are identified by using No. of
criticality score value. The class boundary is identified to
differentiate one class from another. Get nugget score revised Dataset Records Accuracy Precision Recall
algorithm used to list the score value of attribute. Threshold
value is identified by using class boundary. If the data nearer WPBC 700 94.6 1.2 0.332
to threshold value, this is having chance of moving from one
class to another. Get nugget score revised algorithm works in WPBC 450 97.6 1.45 0.45
two phases. First phase working in relevant class and second
phase works in moved class. Find critical nuggets algorithm
used to detect critical nuggets to improve the classification
accuracy. Fig 2: shows the accuracy chart for the proposed research

4. EXPERIMENTAL RESULTS FOR KM-


SVM ACCURACY
Reporting Performance Based on Predicting Non-Faulty Units
Some papers have reported performance measures based on CR_rotationTest
predicting the majority (non-faulty class) rather than the WPBC
minority (faulty) class. In some of these cases it is also not
made clear that the predictive performance is on the majority 9.65 9.46 CR_Nugget
class. These issues can be very misleading when trying to ScoreWPBC
evaluate predictive performance. Table 3 shows the very good
Accuracy, Precision and Recall performances reported SVMWPBC
classification using some sample datasets ds1, ds2, ds3, ds4.
The process to recompute the confusion matrix would not 9.45 9.56
work on these figures when this assumed that the values for KMSVMWPBC
Precision and Recall were based on the no faulty class. Table
4 shows the working for these recompilations. The workings
suggest that proposed model have reported the performance of
their system models based on predicting non-faulty units
rather than faulty units. Since the vast majority of non-faulty
units in data sets are ranging between 84.6% and 96.7% in
their case, predicting the majority class is very easy and so 5. CONCLUSION
high performance figures are to be expected. Table 2 shows Initially classification training file is formed and a new
the results of this calculation. Table 2 suggests that the calculation critical nugget is introduced for measuring the
performance of the system study is much less positive. Table criticality of a subset or nugget. For each column in the
4 show that f- measure ranges from 0.0 to 0.12. This is attributes, corresponding changes are made to find the
compared to their original maximum f-measure of 0.96. critical data. Then, additional concepts are used to resolve
conflicting scores when they occur. It critically helps in
finding critical information during the classification task.
Table 2. Performance Measures for the Faulty Class using Finally, the critical information are identified using the
the Values from Table. CR score value and the values are added to the histogram
and the reduced instances are used training. The critical
Dataset Accuracy Precision Recall instance is used in the classification task for predicting the
classification accuracy. This helps in improving the
WPBC 94.6 1.2 0.332 classification accuracy in the given datasets. This project
implemented for Breast Cancer diagnostic and prognostic

29 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security
National Conference on Research Issues in Image Analysis & Mining Intelligence (NCRIIAMI-2015)

results for an immerge between immune-computing and [9] F. Angiulli and C. Pizzuti, Outlier Mining in Large
features reduction. Where an immune-computing is one of HighDimensional Data Sets, IEEE Trans. Knowledge Data
the newest directions in bio-inspired machine learning and Eng., vol. 17, no. 2, pp. 203-215, Feb. 2005.
has very fruitful successes in different area. The [10] Y. Tao, X. Xiao, and S. Zhou, Mining Distance-
classification selection theory is one of the first applied Based Outliers from Large Databases in Any Metric Space,
theories in KM-SVM, and in this paper supported with Proc. 12th ACM SIGKDD Intl Conf. Knowledge Discovery
features reduction technique Support Vector Machine as a and Data Mining (KDD), T. EliassiRad, L.H. Ungar, M.
first step before the start of immune defense. The Craven, and D. Gunopulos, eds., pp. 394-403, 2006.
proposed KM-SVM has done with good improved results [11] V. Chandola, A. Banerjee, and V. Kumar, Anomaly
against Breast Cancer datasets. Detection: A Survey, ACM Computing Survey, vol. 41, no.
3, article 15, 2009.
[12] S.D. Bay and M. Schwabacher, Mining Distance-
Based Outliers in Near Linear Time with Randomization and
6. REFERENCES a Simple Pruning Rule, Proc. Ninth ACM SIGKDD Intl
[1] A.Koufakou and M..Georgiopoulos, A Fast Outlier Conf. Knowledge Discovery and Data Mining (KDD), L.
Detection Strategy for Distributed High-Dimensional Data Getoor, T.E. Senator, P. Domingos, and C. Faloutsos, eds., pp.
Sets with Mixed Attributes, Data Mining and Knowledge 29-38, 2003.
Discovery, vol. 20, no. 2, special issue SI, pp. 259-289, Mar. [13] A. Ghoting, S. Parthasarathy, and M.E. Otey, Fast
2010. Mining of distance-Based Outliers in High-Dimensional
[2] R.A. Weekly, R.K. Goodrich, and L.B. Cornman ,An Datasets, Data Mining and Knowledge Discovery, vol. 16,
Algorithm for Classification and Outlier Detection of Time- no. 3, pp. 349-364, 2008.
Series Data, J. Atmospheric and Oceanic Technology, vol. [14] M.M. Breunig, H.-P. Kriegel, R.T. Ng, and J. Sander,
27, no. 1, pp. 94-107, Jan. 2010. LOF: Identifying Density-Based Local Outliers, SIGMOD
[3] M. Ye, X. Li, and M.E. Orlowska, Projected Outlier Record, vol. 29, no. 2, pp. 93-104, 2000.
Detection in High-Dimensional Mixed-Attributes Data Set, [15] L. Duan, L. Xu, Y. Liu, and J. Lee, Cluster-Based
Expert Systems with Applications, vol. 36, no. 3, pp. 7104- Outlier Detection, Annals of Operations Research, vol. 168,
7113, Apr. 2009. no. 1, pp. 151- 168, http://dx.doi.org/10.1007/s10479-008-
[4] K. McGarry, A Survey of Interestingness Measures 0371-9, Apr. 2009.
for Knowledge Discovery, Knowledge Eng. Rev., vol. 20, [16] N. Panda, E.Y. Chang, and G. Wu, Concept
no. 1, pp. 39-61, 2005. Boundary Detection for Speeding Up SVMs, Proc. 23rd Intl
[5] L. Geng and H.J. Hamilton, Interestingness Measures Conf. Machine Learning (ICML), W.W. Cohen and A. Moore
for Data Mining: A Survey, ACM Computing Surveys, vol. eds., vol. 148, pp. 681-688, 2006.
38, article 9, http://doi.acm.org/10.1145/1132960.1132963, [17] P. Domingos, Metacost: A General Method for
Sept. 2006. Making Classifiers Cost-Sensitive, Proc. Fifth Intl Conf.
[6] E.Triantaphyllou, Data Mining and Knowledge Knowledge Discovery and Data Mining, pp. 155-164, 1999.
Discovery via Logic-Based Methods. Springer, 2010. [18] A. Frank and A. Asuncion, UCI Machine Learning
[7] E.M. Knorr, R.T. Ng, and V. Tucakov, Distance- Repository, http://archive.ics.uci.edu/ml, 2010.
Based Outliers: Algorithms and Applications, VLDB J., vol. [19] David Sathiaraj and Evangelos Triantaphyllou On
8, no. 3/4, pp. 237- 253, 2000. identifying Critical Nuggets of Information during ification
[8] D. Hawkins, Identification of Outliers (Monographs on Tasks IEEE transactions on knowledge and data engineering,
Statistics and Applied Probability). Springer, vol. 25, no. 6, June 2013.
http://www.worldcat.org/isbn/ 041221900X, 1980.

30 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security
National Conference on Research Issues in Image Analysis & Mining Intelligence (NCRIIAMI-2015)

Analysis on SNP tools


Yogalakshmi. S Venkatesh Kumar. R Lawrance. R
Research Scholar Assistant Professor of MCA Director, MCA
Ayya Nadar Janaki Ammal College Ayya Nadar Janaki Ammal College Ayya Nadar Janaki Ammal College
Sivakasi Sivakasi Sivakasi
s.yogalakshmimsc@gmail.com vengatesh.kumar@gmail.com lawrancer@yahoo.com

ABSTRACT coding regions of the genome [1]. The coding gene changed
In the innovative development Information technology it has into amino acid sequences and the non-coding gene changed
been join hands with various areas especially in the level of gene expression. SNPs occur throughout the
bioinformatics. Due to this emergence the DNA plays a vital human genome about one in every 300 nucleotide base pairs.
role in classifies human sequence. Single Nucleotide This translates to about 10 million SNPs within the 3-billion-
Polymorphism (SNP) is the simplest form of DNA variation nucleotide human genome [2]. DNA contains at least 10
among individuals. All human beings differ from other by million SNPs. The SNP can be discovered through shotgun
physical appearance, susceptibility of disease and response of sequencing and resequencing of targeted genomic regions [3].
drugs. These differences are occurred by a single nucleotide The set of SNP in the genome called haplotype blocks. The
change in the DNA sequence of human. The analysis of SNPs haplotype analysis is used for finding the diseases affected
identifies the disease susceptibility in a gene using the gene gene location. In this paper we present an overview of the
mapping techique. Mainly SNPs are applied for personal methodology for mining SNP from DNA sequence data and
medication. In this paper, we describe the process to discover analysis of SNP discovery software tools. Gene mapping is
SNPs and its processing tools. the process of establishing the locations of genes on the
chromosomes. Early gene maps used linkage analysis. More
recently, scientists have used recombinant DNA (rDNA)
Keywords techniques to establish the actual physical locations of genes
Single nucleotide polymorphism (SNP), Disease on the chromosomes.
susceptibility, DNA, Personal medication, Gene Mapping.
2. SNP DISCOVERY PROCESS
Generally mining SNP from DNA sequence data by following
1. INTRODUCTION steps:
A Single Nucleotide Polymorphism (SNP) is a DNA sequence
variation occurring when a single nucleotide A, T, C, G in the 1. Read the sequence data.
genome differs between members of a species. Generally, 2. Grouping sequence reads by using the method
99.9% of human DNA sequence A, G, C and T bases are clustering or mapping.
3. Aligning the sequence reads.
identical and remaining 0.01% makes a person unique. The
4. Identifying the sequence variant as potential
sequences are selected from the same region of the genome of polymorphisms.
two different people and the two sequences are compared to The process is shown in Figure. 2
find the variations that may be SNPs or insertions/deletions
(Indels). The SNP and Indel are represented in the following 3. SEQUENCE DATA
Figure.
The sequence data can be classified into two types. They
are
XACGGTAGTA 1. Reference sequence data
2. De novo sequence data.
SNPINDEL
3.1 Reference Sequence Data
YACGATAGTAA The reference sequence data are known fragments and already
it has sequence to refer the new sequence data. The examples
of reference sequence data are human genome and many
Fig. 1 SNP Variation microbial species. In recent years the species are rapidly
From the above Figure. 1 X and Y are indicates the DNA increasing, so Next Generation Sequencing (NGS)
sequence of two humans. The DNA sequence of X and Y are technologies are used. It allowing high-throughput (HT)
almost identical remaining two nucleotide only changed. It is sequencing will likely to bring many SNPs to the attention of
called as SNP or Indel (Insertion/deletion). The variations researchers [4]. The NGS technologies include:
may be of harmful or harmless. In the case of harmful, it
cause some disease and for harmless, it change the phenotype
(appearance, charaterstics) only. SNPs are the most abundant
form of genetic variation and are the basis for most molecular
markers. Many SNPs can occur in coding (gene) and non-

31 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security
National Conference on Research Issues in Image Analysis & Mining Intelligence (NCRIIAMI-2015)

Sequencevariantidentification
Referencesequence Mapping
data(Knownseq.)
(BLAST) Alignmentand
Assembly FindingSNPs

Denovosequence Cluster
data(Unknownseq.)
(d2cluster)

Fig. 2 Process of analyzing SNP

1. Illumina (solexa) sequencing 4. RELATED LITERATURE


2. Roche 454 sequencing Nickerson et al. [12] Used a tool called Polyphred
3. Ion torrent: Proton sequencing automatically detecting the SNPs using fluorescence-based
4. SoLiD sequencing resequencing method and particularly for heterozygous
There are three kinds of reference sequence data. They are diploid samples. Its operations are integrated with the use of
automate base calling (Phred), sequence assembly (Phrap) and
All sequence data generated from fully sequenced sequence assembly editing (Consed) in high throughput
species belong to the reference sequence data. system for detecting DNA polymorphisms and mutations. It
When sequencing has not reached a finished status, calculates base calling quality and identifies sites with high
it can be used to map with newly created sequence second base peaks. These sites have low base quality scores
data. but are indications for a heterozygote locus, thus a SNP. The
The reference data are those sequences generated improved version of the Polyphred was developed by stephens
from a known PCR product (amplicon). The PCR et al., [13] to achieve significantly lower false discovery rates.
products can be used for visualize the large set of
Gabor et al. [14] presented a tool called PolyBayes, it
variation specific to amplicon, resulting in ultra
deep sequencing. automatically exploit the abundance of sanger sequencing
data for SNP mining. It uses both reference sequence data for
After got reference sequence data has obtained then map mapping and base calling quality scores as produced by
the sequences by using the BLAST tool. Mapping refers to the Phred. It calculating the probability of a sequence cluster
process of aligning short reads to a reference sequence, representing true allelic sequence variant and the likehood of
whether the reference is a complete genome or de the site being a true SNP. The coverage redundancy, base
novo assembly. A tool is required to map the new sequence quality values and a priori estimate of polymorphism rate are
reads to the reference sequence data. A whole genome data or used to calculate the latter. The PolyBayes method was tested
a part of specific genome can be used either global alignment on human EST data. PolyBayes has also been used for both
tool or local alignment tool such as BLAST [5] and SSAHA 454/Roche and Solexa/Illumina machines. As PolyBayes was
[6]. Optionally, a tool can be used to ignore repeated sequence developed for Sanger sequencing, adaptations are necessary to
regions using Repeat Masker [7], which requires a list of optimize the tool for the characteristics of these new
known repeat sequences as input. To ignore repeated sequencing machines. PyroBayes has been developed for the
sequence is one of the preprocessing steps for DNA sequence. Roche GS-20 machine. PolyFreq is an improved version of
The mapped data is the aligned position of the new sequence PolyBayes aimed at SNP mining in cluster with deep
read on the reference. Pairwise or multiple alignments can be coverage [15]. The main differences between these two
used for the evaluation of base constitution on each position methods are PolyBayes creates a multiple alignment of all
and the consequent SNP identification. Phrap [8] and CAP3 clustered reads and PolyFreq performs only pairwise
[9] are software tools used for assembling the sequences to alignments of new reads with the reference region. In
contigs. validation the numbers of false negatives and false positives
were lower for PolyFreq than for PolyBayes.
3.2 De Novo Sequence Data Weckx et al. [16] suggested a tool called novoSNP an
In the de novo SNP mining approach, where no reference alternative form to PolyPhred for resequenced regulatory
sequence is available. After read the sequence data, using the regions to improve false detection rates. The algorithm has
clustering technique grouping of the sequence data that belong two major steps: detection and validation supported by a
to the same region of the genome. Several sequence and graphical user interface. The input is a reference sequence
assembly tools can perform this task, as they split up input combined with a set of trace files, which are aligned by
datasets which cannot be assembled into single contig. BLAST. Mismatches are identified and rated in a cumulative
Specialized tools such as d2cluster [10] and TGICL [11] have way by four different scores, using peak heights, differences
been developed to perform an initial segregation of sequence between peaks, and agreement between forward and reverse
fragments into homologous groups, which can be further reads. The algorithm was applied to a set of over 10,000 trace
decomposed into cluster of unique origin by assembly tools. files originating from two human gene regions and the
Next to the aligning step, identifying and classifying sequence performance was compared with that of both PolyPhred and
variants as potential polymorphisms. PolyBayes. It outperformed the two alternative methods in
both false-negative and false-positive rates. The latest release

32 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security
National Conference on Research Issues in Image Analysis & Mining Intelligence (NCRIIAMI-2015)

includes a database to keep track of the status of sequence for visualization of SNPs and alleles, which supplies the
variants and annotation. selection of informative SNP and allele-specific markers.

Zhang et al. [17] developed a tool for sensitive and accurate Amigo et al. [23] used a tool called ENGINES (ENtire
detection of SNP called SNPdetector. It is aimed at Genome INterface for Exploring SNVs). It is used to retrieve
automating the manual review step in the SNP mining single nucleotide variation from 1000 genomes. The whole
procedure, as an alternative to PolyPhred. It starts from base dataset is pre-processed and summarized into a data mart
calling quality scores (Phred) on specific amplicon fragments. accessible through a web interface. The query system allows
These are aligned using SIM [18], allowing for substantial the combination and comparison of each available population
sequence variation. Next, polymorphisms are identified in sample, while searching by rs-number list, chromosome
regions of high base quality (neighborhood quality standard). region, or genes of interest. Frequency and FST filters are
Finally, heterozygote genotypes are identified and SNPs available to further refine queries, while results can be
evaluated by means of comparing forward and reverse visually compared with other large-scale Single Nucleotide
sequencing data. SNPdetector is tested on mouse and human Polymorphism (SNP) repositories such as HapMap or
dataset. It reduces the false positive rate and false negative Perlegen.
rate compared to PolyPhred and NovoSNP.
5. CONCLUSION
Manaster et al. [19] introduced a tool called InSNP for In this paper a deep analyzed study on SNP tools. SNP
automated detection and visualization of SNP and indels. It is
analyses are mainly used for personalized medication. From
similar to PolyPhred and novoSNP. This software was
targeted for mutation detection and compared its performance this analysis on SNP tools we find the ENGINES tool is the
to PolyPhred and Mutation surveyor. It reduced the false efficient to compare with previous tools such as polyphred,
positive rate and false negative rate compared to PolyPhred. HaploSNPer and novoSNP. SNPs are mainly used to find the
disease susceptibility gene location using the gene mapping
Dereeper et al. [20] introduced a web based tool called technique.
SNiPlay, for detecting SNPs and indels in DNA sequence.
From standard sequence alignments, genotyping data or
Sanger sequencing traces given as input, SNiPlay detects 6. REFERENCES
SNPs and indels events and outputs submission files for the [1] Garg, K., Green, P., and Nickerson, D. A. 1999.
design of Illuminas SNP chips. Subsequently, it sends Identification of candidate coding region single
sequences and genotyping data into a series of modules in nucleotide polymorphisms in 165 human genes using
charge of various processes: physical mapping to a reference assembled expressed sequence tags. Genome Research,
genome, annotation, SNP frequency determination in user- 9(11), 1087-1092.
defined groups, haplotype reconstruction and network, linkage [2] http://learn.genetics.utah.edu/content/pharma/snips/
disequilibrium evaluation, and diversity analysis. When input
is provided as standard FASTA alignments, SNiPlay uses a [3] Salisbury, B. A., Pungliya, M., Choi, J. Y., Jiang, R.,
home-made Perl module to detect SNPs and insertion/deletion Sun, X. J., and Stephens, J. C. 2003. SNP and haplotype
events and to extract allelic information for each polymorphic variation in the human genome. Mutation
position. A database storing polymorphisms, genotyping data Research/Fundamental and Molecular Mechanisms of
and grapevine sequences released by public and private Mutagenesis, 526(1), 53-61.
projects. It allows the user to retrieve SNPs using various
[4] Van Oeveren, J., and Janssen, A. 2009. Mining SNPs
filters, to compare SNP patterns between populations, and to
from DNA sequence data; computational approaches to
export genotyping data or sequences in various formats.
SNP discovery and analysis. In Single Nucleotide
Polymorphisms (pp. 73-91). Humana Press.
Batley et al. [21] presented a tool called SNPserver. It is a
real-time flexible tool for the discovery of SNPs within DNA [5] Altschul, S. F., Gish, W., Miller, W., Myers, E. W., and
sequence data. The program uses BLAST, to identify related Lipman, D. J. 1990. Basic local alignment search tool.
sequences, and CAP3, to cluster and align these sequences. Journal of molecular biology, 215(3), 403-410.
The alignments are parsed to the SNP discovery software
autoSNP, a program that detects SNPs and insertion/deletion [6] Ning, Z., Cox, A. J., and Mullikin, J. C. 2001. SSAHA: a
polymorphisms (indels) in maize expressed sequence tag data. fast search method for large DNA databases. Genome
SNP discovery is performed using a redundancy-based research, 11(10), 1725-1729.
approach. [7] Smit, A. F., Hubley, R., and Green, P. 1996.
RepeatMasker Open-3.0.
Tang et al. [22] developed a tool called HaploSNPer. It is a
flexible web-based tool for detecting SNPs and alleles in user- [8] Green, S. J., Monreal, R. P., White, A. T., Bayer, T. G.,
specified input sequences from both diploid and polyploid Arquiza, Y. D., Buenaflor Jr, R., and Arquiza, N. Y.
species. It includes BLAST for finding homologous sequences 1999. PHRAP documentation.
in public EST databases, CAP3 or PHRAP for aligning them, [9] Huang, X., and Madan, A. 1999. CAP3: A DNA
and QualitySNP for discovering reliable allelic sequences and sequence assembly program. Genome research, 9(9),
SNPs. All possible and reliable alleles are detected by a 868-877.
mathematical algorithm using potential SNP information.
Reliable SNPs are then identified based on the reconstructed [10] Burke, J., Davison, D., and Hide, W. 1999. d2_cluster: a
alleles and on sequence redundancy. It can efficiently detect validated method for clustering EST and full-length
reliable SNPs, reconstruct haplotypes and therefore identify cDNA sequences. Genome research, 9(11), 1135-1142.
different alleles using only EST sequence information.
Furthermore, HaploSNPer supplies a user friendly interface

33 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security
National Conference on Research Issues in Image Analysis & Mining Intelligence (NCRIIAMI-2015)

[11] Pertea, G., Huang, X., Liang, F., Antonescu, V., Sultana, [23] Amigo, J., Salas, A., and Phillips, C. 2011. ENGINES:
R., Karamycheva, S. and Quackenbush, J. 2003. TIGR exploring single nucleotide variation in entire human
Gene Indices clustering tools (TGICL): a software genomes. BMC bioinformatics, 12(1), 105.
system for fast clustering of large EST datasets.
Bioinformatics, 19(5), 651-652.
[12] Nickerson, D. A., Tobe, V. O., and Taylor, S. L. 1997.
PolyPhred: automating the detection and genotyping of
single nucleotide substitutions using fluorescence-based
resequencing. Nucleic acids research, 25(14), 2745-2751
[13] Stephens, M., Sloan, J. S., Robertson, P. D., Scheet, P.,
and Nickerson, D. A. 2006. Automating sequence-based
detection and genotyping of SNPs from diploid samples.
Nature genetics, 38(3), 375-381.
[14] Marth, G. T., Korf, I., Yandell, M. D., Yeh, R. T., Gu, Z.,
Zakeri, H., and Gish, W. R. 1999. A general approach to
single-nucleotide polymorphism discovery. Nature
genetics, 23(4), 452-456
[15] Wang, J., and Huang, X. 2005. A method for finding
single-nucleotide polymorphisms with allele frequencies
in sequences of deep coverage. BMC bioinformatics,
6(1), 220.
[16] Weckx, S., Del-Favero, J., Rademakers, R., Claes, L.,
Cruts, M., De Jonghe, P.and De Rijk, P. 2005. novoSNP,
a novel computational tool for sequence variation
discovery. Genome Research, 15(3), 436-442.
[17] Zhang, J., Wheeler, D. A., Yakub, I., Wei, S., Sood, R.,
Rowe, W., and Buetow, K. H. 2005. SNPdetector: a
software tool for sensitive and accurate SNP detection.
PLoS computational biology, 1(5), e53.
[18] Huang, X., Hardison, R. C., and Miller, W. 1990. A
space-efficient algorithm for local similarities. Computer
applications in the biosciences: CABIOS, 6(4), 373-381.
[19] Manaster, C., Zheng, W., Teuber, M., Wchter, S.,
Dring, F., Schreiber, S., and Hampe, J. 2005. InSNP: a
tool for automated detection and visualization of SNPs
and InDels. Human mutation, 26(1), 11-19.
[20] Dereeper, A., Nicolas, S., Le Cunff, L., Bacilieri, R.,
Doligez, A., Peros, J. P., and This, P. 2011. SNiPlay: a
web-based tool for detection, management and analysis
of SNPs. Application to grapevine diversity projects.
BMC bioinformatics, 12(1), 134.
[21] Savage, D., Batley, J., Erwin, T., Logan, E., Love, C. G.,
Lim, G. A., and Edwards, D. 2005. SNPServer: a real-
time SNP discovery tool. Nucleic acids research,
33(suppl 2), W493-W495.
[22] Tang, J., Leunissen, J. A., Voorrips, R. E., van der
Linden, C. G., and Vosman, B. 2008. HaploSNPer: a
web-based allele and SNP detection tool. BMC genetics,
9(1), 23.

34 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security
National Conference on Research Issues in Image Analysis & Mining Intelligence (NCRIIAMI-2015)

Preprocessing of Tamil Handwritten Character Image using


Various Filters

S. Ponmani* Dr. S.Pannirselvam


Ph. D Research Scholar Head & Associate Professor
Department of Computer Science Department of Computer Science
Erode Arts and Science College (Autonomous) Erode Arts and Science College (Autonomous)
Erode, Tamil Nadu, India Erode, Tamil Nadu, India

ABSTRACT its pattern and its probabilistic characteristics. There is a wide


Today, image processing penetrates into various fields, but till variety of noise types such as Gaussian noise, salt and pepper
it is struggling in image quality issues. Handwriting has noise, poison noise, impulse noise, speckle noise.
continued to persist as a means of communication and 2. CHARACTERISTICS OF TAMIL SCRIPTS
recording information in day-to-day life even with the
introduction of new technologies. Handwriting is a skill that is Tamil scripts Tamil which is one of the ancient
personal to individuals. Recognition of characters is an languages in India is the native language of south India. Tamil
important area in machine learning. However, in numerous letters have circular shapes as they were originally carved
situations, a pen together with paper or a small notepad is with needles on palm leaves [1]. The Tamil script has 12
much more convenient than a keyboard. Various methods are vowels (uyireluttu), 18 consonants (meyyeluttu) and one
been presented for Handwritten character image. But havent character, the aytam, which is classified in Tamil grammar as
got achieved better preprocessed image. It is necessary to being neither a consonant nor a vowel, though often
perform several document analysis operations prior to considered as part of the vowel set. The script, however, is
recognizing text in scanned documents. This paper presents syllabic and not alphabetic. The complete script, therefore,
detailed analysis of various preprocessing operations consists of the thirty-one letters in their independent form, and
performed prior to recognition of Tamil handwritten an additional 216 combinant letters representing a total 247
characters. In this method includes Noise removal, Image combinations (uyirmeyyeluttu) of a consonant and a vowel, a
enhancement using various morphological techniques and mute consonant, or a vowel alone. These combinant letters are
skew correction. The experimental result shows that this formed by adding a vowel marker to the consonant. Some
method provides better enhancement in term of quality when vowels require the basic shape of the consonant to be altered
compared with the existing methods such as Median Method, in a way that is specific to that vowel. Others are written by
Gaussian Filter and wiener Filter. The peak Signal Noise adding a vowel-specific suffix to the consonant, yet others a
Ratio (PSNR) and Mean Square Error (MSE) are been used prefix, and finally some vowels require adding both a prefix
for similarity measures. and a suffix to the consonant.
3. RELATED WORK
General Terms
Pattern Recognition, Document Image Recognition, Optical Humans are still outperforming than machines in an
Character Recognition (OCR). area of handwritten character recognition. Many researchers
have contributed their work in area of both online and offline
Keywords handwriting recognition in different language script by
Noise, Filters PSNR, MSE proposing various techniques and model. This paper discusses
previous work of preprocessing and approach of collecting
dataset in an area of offline handwritten character recognition.
1. INTRODUCTION Baheti M. J. et. al. [6] have reported that no
Tamil which is a south Indian language is one of the oldest standardize dataset of handwritten images for Gujarati script
languages in the world. It has been influenced by Sanskrit to a is available and hence proposed sample datasheet and
certain degree[2]. But Tamil is unrelated to the descendents of collected handwritten Gujarati numbers from 80 writers
Sanskrit such as Hindi, Bengali and Gujarati. Most Tamil belonging to various diversities and applied some
letters have circular shapes partially due to the fact that they preprocessing algorithms and employed k-nearest neighbor
were originally carved with needles on palm leaves, a and principal component analysis classifier for Gujarati
technology that favored round shapes. Tamil script is used to numeral recognition.
write the Tamil language in Tamil Nadu, SriLanka, Singapore
and parts of Malaysia, as well as to write minority languages For handwritten Gujarati numeral recognition
such as Badaga. Tamil alphabet consists of 12 vowels, 18 Apurva a. Desai [7] has collected Gujarati numerals from 300
consonants and one special character (AK). Vowels and writers and have applied preprocessing techniques to bring
consonants are combined to form composite letters, making a images into standard form, further how quality of paper
total of 247 different characters and some Sanskrit characters. influences writing and preprocessing required is discussed.
The complete Tamil alphabet and composite character Preprocessing task involved is adjustment of contrast,
formations are given in [5]. The advantage of having a smoothing, resizing image to standard form i.e. 16x16 pixel.
separate symbol for each vowel in composite character Using nearest neighborhood classification is performed.
formations, there is a possibility to reduce the number of
symbols used by the alphabet. The noise is characterized by

35 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security
National Conference on Research Issues in Image Analysis & Mining Intelligence (NCRIIAMI-2015)

For recognizing Kannada, telugu and devnagari


handwritten numeral B.V. Dhandra et. al. [8] have proposed
novel approach where noise is removed by median filter to
...........(2)
remove scanning artifacts morphological operations are
performed. Mean filtering is a simple, intuitive and easy to
implement method of smoothing images, i.e. reducing the
Kamal Moro et.al. [9] has reported that there is no
amount of intensity variation between one pixel and the next.
standard database available for Gujarati and hence developed
It is often used to reduce noise in images.
a database collecting handwritten characters from large
number of writers and scanned at 300 dpi and have binarized The idea of mean filtering is simply to replace each
and skeletonized images. For feature extraction horizontal and pixel value in an image with the mean value of its neighbours,
vertical and two diagonal profiles used and classified using including itself. This has the effect of eliminating pixel values
neural network in a task of recognizing Gujarati handwritten which are unrepresentative of their surroundings. Mean
numerical optical character. filtering is usually thought of as a convolution filter. Like
other convolutions it is based around a kernel, which
In order to overcome the issues in the existing
represents the shape and size of the neighbourhood to be
methods the directional filtering approach using Bayers
sampled when calculating the mean.
pattern is used to improve the quality of the image.
4.4 Wiener Filter
4.EXISTING METHODOLOGY
The wiener filtering method requires the
4.1 Filters information about the spectra of the noise and the original
Generally, filters are used to filters the unwanted signal and it works well only if the underlying signal is
things or object in a spatial domain or image surface. In smooth. Wiener method implements spatial smoothing and its
digital image processing, mostly the images are been affected model complexity control correspond to choosing the window
by various noises. The main objectives of the filters are size [9].
applied to improve the quality of the image by enhancing the
interoperability of the information present in the image for
human visual. A general structure of a filter mask is as
follows.
..............(3)
-1 -1 -1 Where
-1 N -1 H(u,v) = Degradation function
-1 -1 -1 H*(u, v) = Complex conjugate of degradation function
Fig.1 Filtering Mask Pn (u, v) = Power Spectral Density of Noise
Image filtering can be used for many aspects which Ps (u, v) = Power Spectral Density of un-degraded image
includes, smoothing, sharpening, noise eliminating and edge
detection etc. A filter is defined by a kernel, which Wiener filtering is able to achieve significant noise removal
represented is a small array and applied to each pixel and its when the variance of noise is low they cause blurring and
neighbours within an image. smoothening of the sharp edges of the image. Detection of
emotions in highly corrupted noisy environment this approach
4.2 Gaussian Filter involves removal of noise from the image by the Wiener Filter
for an automatic system for the recognition of facial
Gaussian filters are the linear smoothing filter with the
expressions is based on a representation of the expression
weights is selected based on the Gaussian distribution
[10].
functions. Mainly, these kinds of filters are used to smooth the
image and to eliminate the Gaussian noises present in the 5.METHODOLOGY
image. This is formulated as follows.
By considering the inefficiency of the existing
1 m2 1 n2 image enhancement method there is a need to propose a new
h(m, n) = [ e ]X [ ] methodology for finger print image enhancement leads to
2 2 2
2 e 2 2 ...............(1)
image quality. Here we propose a novel filtering technique for
handwritten character images.
From the above equation (1) the Gaussian filter is separable.
The Gaussian smoothing filter is very good in noise removal 5.1 Median Filter (MF)
in normal distribution function. This filter is rotationally
symmetric the amount of smoothening is all direction. The Median filter, the most prominently used impulse
degree of smoothening is governed by the variance T. noise removing filter, provides better removal of impulse
noise from corrupted images by replacing the individual
4.3 Mean Filter (MF) pixels of the image as the name suggests by the median value
of the gray level The median of a set of values is such that
The Mean Filter is a linear filter which uses a mask
half of its values in the set are below the median value and
over each pixel in the signal. Each of the components of the
half of them are above it and so is the most acceptable value
pixels which fall under the mask are averaged together to
than any other image statistics value for replacing the impulse
form a single pixel. This filter is also called as average filter.
corrupted pixel of a noisy image for if there is an impulse in
The Mean Filter is poor in edge preserving. The Mean filter is
the set chosen to determine the median it will strictly lie at the
defined by:
ends of the set and the chance of identifying an impulse as a
median to replace the image pixel is very less. The best

36 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security
National Conference on Research Issues in Image Analysis & Mining Intelligence (NCRIIAMI-2015)

known order-statistics filter is the median filter, which and determine the variance in the number of black pixels per
replaces the value of a pixel by the median of the gray levels projected line.The projetion parallel to the true alignment of
in the neighborhood of that pixel, the lines will likely have the the maximum variance, since
when parallel, each given ray projected through the image will
The median filter is an effective method that can hit either almost no black pixels (as it passes between text
suppress isolated noise without blurring sharp edges. In lines) or many black pixels (while passing through many
Median Filtering, all the pixel values are first sorted into characters in sequence).
numerical order and then replaced with the middle pixel
value. [8]Let y represent a pixel location and w represent a Skew Correction
neighborhood centered around location (m, n) in the image,
then the working of median filter is given by After the skew angle of the page has been detected,
our recognition algorithm demands that the page must be
y [m, n]=median{x[ i ,j],( i , j) belongs to w}.......(1) rotated to correct for this skew. Our rotation algorithm had to
be both fairly fast and fairly accurate. It was a pure coordinate
Since the pixel y[m ,n] represents the location of the transformation, which takes a little bit of time on large
pixel y ,m and n represents the x and y co-ordinates of y. W images, but gets the rotation exact.
represents the neighborhood pixels surrounding the pixel
position at (m, n).( i , j) belongs to the same neighborhood 5.3 Algorithm
centered around (m, n).Thus the median method will take the
median of all the pixels within the range of ( i , j) represented Input: original Image
by x[ i, j]. Output: Preprocessed image
The original value of the pixel is included in the Algorithm:
computation of the median. Median filters are quite popular
because, for certain types of random noise they provide Step 1: Start
excellent noise reduction capabilities, with considerably less Step 2 : Select an Input Image from the DB
blurring than linear smoothing filters of similar size. Fig 1.2
illustrates an example of how the median filter is calculated. Step 3: Preprocess the image by the following
Apply Median filter on Input Image
Calculate the threshold value.
Apply the threshold value in the image .Estimate
MSE and PSNR
step 4 : Correct the skew lines in the image.
Step 5 : Repeat step 2 to step 4 for all images in DB
Step 6: stop

6.SIMILARITY MEASURES
6.1 Mean Squared Error (MSE)
Mean square error is calculated by
Fig. 1.2 Median Filter
1
i =1 i =1
[ g(i, j) f (i, j)]
M N 2
5.2 Thresholding MSE =
The task of thresholding is to extract the foreground from the MN
background. A number of thresholding techniques have been Where M and N are the total number of pixels in the
previously proposed using global and local techniques. Global horizontal and the vertical dimensions of image, g denotes the
methods apply one threshold to the entire image while local Noise image and f denotes the filtered image.
thresholding methods apply different threshold values to
different regions of the image[Leedham]. The histogram of 6.2 Peak Signal to Noise Ratio (PSNR)
gray scale values of a document image typically consists of
The peak Signal to Noise ratio is calculated by:
two peaks: a high peak corresponding to the white
background and a smaller peak corresponding to the 2552
foreground. So the task of determining the threshold gray- P S N R = 1 0 lo g 1 0
scale value is one of determining as optimal value in the M SE
valley between the two peaks. Here Otsus method of
histogram based global thresholding algorithm is used. For the image quality measures, if the value of the PSNR is
very high for an image of a particular noise type then is best
5.3 Skew Detection quality image.
There are several commonly used methods for
detecting skew in a page, some rely on detecting connected
components (for many purposes, they are roughly equivalent 7. EXPERIMENTATION AND RESULTS
to characters) and finding the average angles connecting their The proposed model is experimented with different
centroids. The method we employed (after observing it in image in the database. The experiments were carried out in
Fateman's program) was to project the page at several angles, Matlab R2014a. The morphological operations were

37 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security
National Conference on Research Issues in Image Analysis & Mining Intelligence (NCRIIAMI-2015)

performed using Matlabs Image Processing Toolbox. The From the below figure shows the pictorial representation of
dataset consisted of several samples out of which randomly the performance evaluated. By analysing the obtained results
selected samples were used for testing. the proposed model produced the best results. Hence the
Proposed method is an efficient
one.

60
50
40 Gaussian
30 Mean
20 Wiener

Fig 3 : Original Image 10 Proposed


0
1 2 3 4

Fig 6. Performance Evaluation


8.CONCLUSION
In this paper, proposes the new enhancement
method for the handwritten character image. The experimental
result proves the effectiveness of this approach, providing
good PSNR values when compared to existing methods. The
performances of PSNR values of proposed method better than
the existing Filter. The proposed method produces better
Fig 4: Noise Removed Image
accuracy results compared with existing method. Moreover,
the computational cost of the algorithm is very low.
Therefore, the proposed algorithm candidates itself for
implementation is in simple low-cost cameras or in video
capture devices with high values of resolution and frame rate.
The proposed scheme is capable of achieving at least
comparable and often better performance than existing
techniques.

9.REFERENCES
[1]. N.Dhamayanthi, and P.Thangavel, Handwritten
Tamil character recognition using neural network,
Fig 5: Skew corrected Image Proceeding of Tamil Internet 2000, Singapore, July
22-24,2000, pp. 171-176.
The experimented results of the proposed method and the
existing method with the computation of PSNR values are [2]. S.Hewavitharana, H.C.Femando, A two stage
presented in the following tables. classification approach to Tamil Handwriting
recognition, Tamil Internet 2002, California, USA,
Table 1: Resultant Table pp. 118-124
Image No. Gaussian Mean Wiener Proposed [3]. Srihari, Online and Offline handwriting
recognition: A comprehensive survey, IEEE
1 30.25 29.47 30.77 53.52 PAMI, Vol.22, No.l, Jan.2000.
[4]. P.Chinnuswamy and S.G.Krishnamoorthy,
2 31.05 30.58 29.63 56.62 Recognition of hand printed Tamil characters,
Pattern recognition Vol.12, ppl41-152,1980.
3 30.15 29.62 27.95 55.65
[5]. N.Otsu, A threshold selection method from grey
4 31.10 30.86 30.99 56.68 level histogram, IEEE Transaction. Syst. Man
Cyber., vol.9 no.l, 1979, pp. 62-66.
[6]. K. K. M. BAHETI M. J., "Comparison Of
From the table 1 shows the experimented values obtained Classifiers For Gujarati Numeral Recognition,"
from different preprocessing methods. It shows the selected International Journal of Machine Intelligence,
face image from the database. The performance was evaluated vol.no. 3, pp. 160-163,2011.
using the Mean Square Error (MSE) and Peak Signal Noise
Ratio (PSNR) in order evaluates the quality of the image. By [7]. A. A. Desai, "Gujarati handwritten numeral optical
the analysis of the values in the table the Proposed Method is character reorganization through neural network,"
better with less MSE and high PSNR values. Pattern Recognition, vol. 43, 2010.

38 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security
National Conference on Research Issues in Image Analysis & Mining Intelligence (NCRIIAMI-2015)

[8]. R. M. H. K. B.Dhandra, "Telugu and Devanagari


Handwritten Numeral Recognition with
Probabilistic Neural Network: A Novel Approach,
Architecture," pp. 83-88, 2010.
[9]. M. f. Kamal moro, "Gujarati Handwritten Numeral
Optical Character through neural network and
skeletonization," jurnal of sistem komputer, vol. 3,
no. 1, pp. 40-43,2013.
[10]. K. D. N. Shanthi, "A novel SVM-based handwritten
Tamil character recognition system," pp. 173-180.
[11]. J. Prasad, U. Kulkami and R. Prasad, "Template
Matching Algorithm for Gujarati Character," in In
Proc. Of 2nd International Conference on Emerging
Trends in Engineering and Technology (ICETET),
2009.
[12]. J.Prasad, U. Kulkami and R. Prasad, "Offline
Handwritten Character Recognition of Gujarati
script using Pattern Matching," in In Proc. Of 3rd
International Conference on Anti-counterfeiting,

39 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security
National Conference on Research Issues in Image Analysis & Mining Intelligence (NCRIIAMI-2015)

Protection of Private Data over the Transmission through


Electronic Gadgets using RSA Algorithm

R.Sridevi, R.Rahinipriyadharshini,
Assistant professor, Research Scholar,
Department of computer science Department of Computer Science
PSG College of Arts & Science, PSG College of Arts & Science
Coimbatore, TamilNadu. Coimbatore, TamilNadu.
srinashok@gmail.com rahinipriyadharshini@gmail.com

ABSTRACT hackers to get private data from the users. As a result, many
In the todays world, security is required to transmit researches are needed to find safer ways to transfer data and
confidential information over the network. The security of information on the electronic devices.
information and communication becomes more and more
important with the rapid development of information
2. RELIABLE DATA TRANSMISSION
technology. Cryptographic algorithms play a vital role in AND USAGE IN ELECTRONIC
providing the data security against malicious attacks. The GADGETS
main goal of cryptography is not only to protect data from Understanding where you hold private data internally is
being attacked or stolen rather it can be used for essential, but it does not provide a complete view of where
authentication for users. There are two types of cryptographic your data resides. Understanding data flows is a complicated
algorithms namely Symmetric key cryptographic algorithms task, and technology can help to provide transparency into
and Asymmetric key cryptographic algorithms. Symmetric network based data transfers. However, you must additionally
key cryptographic algorithm uses the only one key for both understand third-party data access, business-driven data
encryption and decryption. Asymmetric key uses two exchange and end- user data transmission capabilities.
different keys to encryption and decryption of the message.
The public key is made publicly available and can be used to 2.1 Third party data access
encrypt messages. The private key is kept secret and can be Data identification should include obtaining an understanding
used to decrypt received messages. This paper is about how of private data that is accessible by third parties. This includes
the data are protected while transferring the data from one data that is exchanged with third parties and third parties that
electronics devices to another using the RSA algorithm. have direct access to internal systems. Once an inventory of
Nowadays, many electronic devices like electronic phones, third parties with access to private data exists, controls should
tablets, personal computers are in the workplace for be implemented to safeguard third-party access. Focus areas
transferring the data. RSA algorithm provides privacy, should include:
integrity and confidentiality. It is mandatory to secure that
information to prevent from unauthorized access of it by any Secure data transmissions
node in the path.
Controlled access to data
Keywords
Monitoring of third-party access to private data.
Cryptography, encryption, decryption, asymmetric
Cryptographic algorithms, RSA algorithm. Third-party due information security assurance.

1. INTRODUCTION 2.2 Design the policies and Standards


Computer data transfer plays a very important role in daily Once private data is classified and identified, data protection
life. The security to a system is essential nowadays ! with the policies should be developed and/or customized to document
growth of the Information Technology and with the security requirements that are specific to the type of private
emergence of new techniques, the number of threats a user is data held by the organization. A high-level policy specifying
supposed to deal with grew exponentially. The process of the requirements for protecting private data should exist and
transfer data is to focus on finding answers for life problem by clearly link to the data classification policy.
transfer information from one person to another. The
Transmission of private data through email and the
importance of the transfer data can range from business,
internet.
schools, companies, and government documents. In order to
analysis how data is being transfer from the electronic device Storage of private data on electronic devices,
and to prevent private data from getting out there, we need to laptops, Tablets, Pcs, CD, workstations.
simulate several different ways how data can be transferred on
the electronic device. Data transfer is achieving by safely Appropriate use of remote access technologies.
copying or moving important data from one location to
User responsibilities for classifying data at the point
another. Some examples are, computer to computer, computer
of creation and ensuring that private data users create is
to electronic device, electronic device to electronic device,
included in relevant data/information inventories.
electronic device to the server, and computer to the server.
Now it is much easier and faster to transfer data today than it
was in the past few decades. Nevertheless, it is even easier for

40 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security
National Conference on Research Issues in Image Analysis & Mining Intelligence (NCRIIAMI-2015)

Private data may not be transmitted through public common words or word groups found in a dictionary. For this
networks without adequate encryption. reason, user-selected passwords are generally considered to be
weaker than randomly-generated passwords.
Approved technologies may be used to exchange
data with third parties. 3.4 Viruses and Worm
Private data may not be shared with third parties Terminology is emerging such as "malware" (or malicious
without sufficient contracts in place specifying information software). Instant Messaging (IM) is very popular because it
provides flexibility, speed and ease of communication, but it
security requirements, their obligations to protect company
is also very vulnerable to attacks because of its flexibility.
data, their responsibilities for monitoring their own third
Attacks are not limited to personal computers (PC). They now
parties and the companys right to audit and monitor.
include cell phones and other processor-based electronics and
Access to private data must be logged and will only increase and become more sophisticated. To protect
monitored where appropriate. the security system database from unwanted electronic
intruders requires that no software be introduced into the
Private data must be anonym zed before being
security network without the Security management's approval.
stored in less controlled environments, such as 0test and
development environments.
3.5 Backup
Private data must be adequately protected through Backups are kept for several reasons: a computer crash,
all stages of the data lifecycle and the systems development documentation, employee/contractor investigation, and file
lifecycle. corruption. For these reasons, there must be several levels of
backup available. The requirements will vary depending upon
2.3 How to protect the electronic gadgets? a specific application.

Harden Electronic gadgets. 3.6 Physical Protection


The purpose of physical protection is to prevent access and
Configurations and enable. detect unauthorized surreptitious access. Protection can be in
the form of conduit, sealed cable trays, locked rooms, and
Features such as password protection and remote alarms that indicate potential tampering or unauthorized
wipe facilities. access.

3. TECHNICAL DATA SECURITY 4. CRYPTOGRAPHY


THREATS TO INFORMATION Cryptography is the art and science of achieving security by
encoding the simple message to make it unreadable. The
SYSTEMS message to send is in simple or ordinary language understood
3.1 Mobile devices by all, it is called a plaintext. The process of converting
Use of electronic devices, such as laptops or handheld devices plaintext into a form which cannot be understood without
including smart phones, is exploding; however, the securities having special information is called encryption. This
of electronicdevices are lagging behind. The situation is unreadable form is called cipher text and this special
complicated by the fact that these devices are often used to information is called encryption key. The conversion of cipher
conduct work outside the organizations to breaches can occur text again into plaintext with a special knowledge is called
in a number of ways: devices can be lost, stolen, or their decryption, whereas special knowledge for decryption is
security can be compromised by malicious code invading the called decryption key. Only the receiver has this special
operating system and applications. Mitigation: To promote knowledge and only receiver can decrypt a cipher text with
data security in case a mobile device is lost or stolen, this knowledge called decryption key.
encryption should be made before storing the private data or There are basically two types of cryptography based
information in your mobile devices. Until more data on the techniques for converting plaintext to cipher and vice
encryption, user authentication, anti-malware solutions versa which are namely called as symmetric and asymmetric
become available and for mobile devices, the best protection cryptography. In symmetric cryptography sender and receiver
policy and monitor the network for malicious activity. use the same key for encryption and decryption of test
whereas in asymmetric cryptography systems two keys
3.2 Removable media namely public and private keys are used for encryption and
The use of removable media on an organizations network
decryption process. By keeping the private key safe, you can
poses a significant security threat. Without proper protection,
assure that the data remain safe. But the disadvantage of
these types of media provide a pathway for malware to move
asymmetric algorithm is that they are computationally
between networks or hosts.proper security measures when
intensive. Therefore, in this proposed system symmetric key
using removable media devices is necessary to decrease the
cryptography is used with the intension of less computation
risk of infecting organizations machines or the entire
but high data security. Cryptography mechanisms are
network. Mitigation: To minimize the security risks, apply
depending on the degree of randomness and uncertainty in the
simple preventative steps. These include disabling the auto
generation of the cipher text from the plain text. Hence
run feature of the operating system on the organizations
depend on the phenomenon of nature there are various types
machines and training users to scan removable media for
of cryptography such as: Modern Cryptography is based on
viruses before opening the files.
the difficult mathematical problems such as prime
3.3 Poor Passwords factorization, matrix manipulation.
It is especially important for users with access to the most
private information. Modern password-cracking programs can
easily break weak passwords, such as those containing

41 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security
National Conference on Research Issues in Image Analysis & Mining Intelligence (NCRIIAMI-2015)

4.1 RSA Algorithm and Private key pair comprise of two uniquely related
cryptographic keys (basically long random numbers. 3048
RSA is a commonly adopted public key cryptography 0241 00C9 18FA CF8D EB2D EFD5 FD37 89B9 E069 EA97
algorithm. The first, and still most commonly used FC20 5E35 F577 EE31 C4FB C6E4 4811 7D86 BC8F BAFA
asymmetric algorithm RSA is named for the three 362F 922B F01B 2F40 C744 2654 C0DD 2881 D673 CA2B
mathematicians who developed it, Rivest, Shamir, and 4003 C266 E2CD CB02 0301 0001. Because the key pair is
Adleman. RSA today is used in hundreds of software products mathematically related, whatever is encrypted with a Public
and can be used for key exchange, digital signatures, or Key may only be decrypted by its corresponding Private Key
encryption of small blocks of data. RSA uses a variable size and vice versa. Public-key cryptography finds application in,
encryption block and a variable size key. The key pair is amongst others, the IT security discipline information
derived from a very large number, n, that is the product of two security. Public-key cryptography is used as a method of
prime numbers chosen according to special rules. Since it was assuring the confidentiality, authenticity and non-
introduced in 1977.RSA has been widely used for establishing reputability of electronic communications and data storage.
secure communication channels and for authentication the Whereas the private key is used to decrypt cipher text or to
identity of service provider over insecure communication create a digital signature.
medium. in the authentication scheme, the server implements
public key authentication with client by signing a unique 4.4 Encryption Algorithm
message from the client with its private key, thus creating The term is most often associated with scrambling plaintext
what is called a digital signature. The signature is then into cipher text this process is called encryption. Encryption
returned to the client, which verifies it using the servers is the conversion of electronic data into another form, called
known public key. Encryption algorithms consume a cipher text, which cannot be easily understood by anyone
significant amount of computing resources such as CPU time, except authorized parties. The primary purpose of encryption
memory, and battery power. This paper examines a method is to protect the confidentiality of digital data stored on
for evaluation performance of various algorithms. A computer systems or transmitted via the Internet or other
performance characteristic typically depends on both the computer networks. Modern encryption algorithms play a
encryption key and the input data. A comparative analysis is vital role in the security assurance of IT systems and
performed for those encryption algorithms at different sizes of communications as they can provide not only confidentiality,
data blocks, finally encryption/decryption speed. Features of but also the following key elements of security.
the RSA algorithm.
4.5 Decryption Algorithm
The process of converting a cipher text message into a
plaintext message Key.RSA cryptosystem is the most
attractive and popular security technique for many
applications, such as electronic commerce and secure Internet
access. The decryption operation has to take more
computational cost to perform modular exponentiation by this
case. Algorithm for Encryption and decryption process is
represented below.

Fig 1: Encryption and decryption using secure RSA


Secrecy and Privacy.
Integrity and non repudiation.
Authentication.

4.2 Private Key cryptography


In cryptography, a private or secret key is used for encryption
Fig 2: Sample decryption algorithm
and decryption. The key known only to the party or parties
that exchange secret messages. In traditional secret 5. FUTURE SCOPE
key cryptography, a key would be shared by the Some future problems will arise in the field of business world
communicators so that each could encrypt and decrypt that how to protect the private data or the organization data
messages. Private Key: For decryption, also known as a secret from the attackers. Since most private data is protected from
key. The main risk in this system is that if either party loses the hackers to get your personal information. Even though, it
the key or it is stolen, the system is broken. So here comes is difficult to tell how the private data will treat in the future
the public key a more recent alternative is to use a and how it will change the electronic gadgets. Future work
combination of public and private keys. In this system, will involve more conscious in protecting the private data on
a public key is used together with a private key. electronic gadgets by implementing the better software,
methods and efficient algorithms for secure transmission of
4.3 Public key cryptography data through electronic devices. We will focus on the
Public-key cryptography, also known as asymmetric
asymmetric cryptographic algorithms such RSA, Elliptic
cryptography, is a class of cryptographic algorithms which
curve algorithms for encrypting and decrypting the data with
requires two separate keys, one of which is secret (or private)
high speed when comparing to the existing system and
and one of which is public. The public key is used to
encrypt plaintext or to verify a digital signature. The Public

42 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security
National Conference on Research Issues in Image Analysis & Mining Intelligence (NCRIIAMI-2015)

implementing in the electronic gadgets for protecting the [4] Manikandan.G, Rajendiran.P, Chakarapani.K,
private data. Krishnan.G, Sundarganesh.G, A Modified Crypto
Scheme for Enhancing Data Security, Journal of
6. CONCLUSION Theoretical and Advanced Information Technology, Jan
Technological innovations have put an end to traditional 2012.
working process. Todays business world is mainly depending
upon the electronic gadgets for their working process such [5] Wang, X., Cryptanalysis for Hash Functions and Some
online business, e-commerce, and for file transferring through Potential Dangers, RSA Conference 2010,
electronic gadgets over the network. This new technologies Cryptographers Track, 2010.
are supposed to occur some serious threats to information [6] Dr. Sandeep Sharma and Rishabh Arora Performance
security. So keeping this in mind necessary planning activities Analysis of Cryptography Algorithms International
should be done to safe guard the information at all levels Journal of Computer Applications (0975 8887) Volume
within a corporation or organization, to secure the data. It is 48 No.21, June 2012.
also important to standardize working process; to determine,
which procedures are in compliance with information security [7] A. Lazou., and G. R. Weir, Perceived Risk and
standards; and safety policies to be set for the usage of Sensitive Data on Mobile Devices, UK. University of
electronic devices and wireless networks. The proposed Strathclyde Publishing., pp. 183-196, 2011.
algorithm reduces the effectiveness of intrusion attacks as [8] William Stallings, Cryptography and Network
only a part of the message will be available even if the Security, ISBN 81-7758-011-6, Pearson Education,
intruder interrupts any message and decrypts it. Also Third Edition.
decrypting part of a message is not very easy. It uses RSA
Algorithm, one of the most effective and commonly used [9] Atul Kahate, Computer and Network Security, Third
cryptographic algorithms and adds more steps to it to reduce Edition,Tata McGraw Hill Publication Company
attacks A crucial part of information security, is the privacy Limited,2013
issue the standards for this issue should also be defined within [10] Ambedkar, B. R., Gupta, A., Gautam, P., & Bedi, S. S.
an organization. (2011, June). An Efficient Method to Factorize the RSA
Public Key Encryption. In Communication Systems and
7. REFERENCES Network Technologies (CSNT), 2011 International
[1] Zhiyong Peng Xiaojuan Li, "The improvement and Conference on (pp. 108-111). IEEE.
simulationofLeachprotocolforWSNs,"Software Engineeri
ng and Service Sciences (ICSESS), 2010 IEEE Internatio [11] NeetuSettia. Cryptanalysis of modern Cryptography
nal Conference on vol., no., pp.500-503, 16-18 July2011. Algorithms. International Journal of Computer Science
and Technology. December 2010.
[2] Bahadori, M.; Mali, M.R.Sarbishei, O,Atarodi, M.;
Sharifkhani, M. , "A novel approach for secure and fast [12] Nentawe Y.Goshwe(2013) Data Encryption and
generation of RSA public and private key son Smart Decryption Using RSA Algorithm in a Network
Card," NEWCAS Conference (NEWCAS), 2010 8th Environment. IJCSNS International Journal of Computer
IEEE International , vol., no., pp.265-268, 20-23 June Science and Network Security, VOL.13No.7.
2010.
[3] Massey, J.L, "An Introduction to Contemporary
Cryptology", Proceedings of the IEEE, Special Section
on Cryptography, 533-549, May 2011.

43 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security
National Conference on Research Issues in Image Analysis & Mining Intelligence (NCRIIAMI-2015)

Review of Techniques Compromising Search Engines: Web


Spam

Dr.S.K.Jayanthi V.Subhashini
Head and Associate Professor Research Scholar
Department of Computer Science Department of Computer Science
Vellalar College for Women Vellalar College for Women
Erode, India Erode, India
jayanthiskp@gmail.com vsubhamohan@gmail.com

ABSTRACT
Search engines are the excellent medium for sharing
information. Web spam reduces the quality of search
results and increases the cost of each processed query
due to the storage and retrieval of useless pages.
Spammers encourage viewers to visit their sites by
providing undeserved advertisements, malicious contents
in their pages and try to install malware on the
machine. Recently, the web spam has increased rapidly,
leading to a degradation of search results. This paper presents
the review of web spam techniques compromising search
engines.
Keywords
Web spam, Content spam, link spam, Content hiding,
Cloaking, Redirection, PageRank, HITS.

1. INTRODUCTION
World Wide Web provides the vast amount of data, as more
and more people rely on the wealth of information available in
it. The increasing importance of search engines has given rise
to web spam that exists to mislead search engines. So that the
web pages would rank high in search results, and thus, capture Figure 1: Top hit list of a major search engine for the
user attention. There are different goals for uploading a spam query download freemp3 digital camera Microsoft
page via spammers, first it is to attract viewers to visit their Linux.
sites to enhance the score of the page in order to increase Search engines consider the occurrence of the query in a web
financial benefits for the site owners, the second is to page. Each type of location is called a field. The text fields for
encourage people to visit their sites in order to introduce their a page p are the document body, the title, the Meta tags in the
company and its products, and to persuade visitors to buy HTML header, and page ps URL. The anchor
their productions, the third is to install malware on victims texts associated with URLs that point to p are also
computer. Web spam can be classified as content spam considered belonging to page p (anchor text field), they
describe the content of p. The terms in ps text fields are
(adding irrelevant word to the document to rank it high) and
used to determine the relevance of p with respect to a
link spam (spam on hyperlink) [1, 2]. specific query (a group of Query terms), with different
weights given to different fields.
2. SPAMMING TECHNIQUES The algorithms use various forms of the fundamental TFIDF
Content spamming and link spamming are two spamming
techniques that influence the ranking algorithms which is used metric to rank web pages [1].Given a specific text field, for
by search engines [5]. each term t that is common for the text field and a query, TF
(t) is the frequency of that term in the text field.
2.1 Content spamming For instance, if the term spam appears 6 times in the
Web spam manipulates the content of web pages by stuffing document body that is made up of a total of 20 terms,
it with keywords repeating several times [5]. A large amount TF(spam) is 6/20 = 0.3.
of machine generated spam pages such as shown

in Figure 1&2

44 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security
National Conference on Research Issues in Image Analysis & Mining Intelligence (NCRIIAMI-2015)

2.1.4 Anchor text spam


Spammers create HTML hyperlinks to page with the desired
anchor text . The spam terms are included not to a target page
itself, but the other pages that point to the target.
So the anchor text gets indexed for both pages which have
impact on the ranking of both the source and target pages.

2.1.5 URL spam


Spammers provide long URLs that include sequences of spam
terms. For instance, one could encounter spam URLs like:
buy-canon-rebel-20d-lens-case.camerasx.com,
buy-nikon-d100-d70-lens-case.camerasx.com.

2.2 Link spamming


Spammers create link structures by increasing the importance
of one or more of their pages [3, 4]

Figure2: A page with no content other than Google ads.

The inverse document frequency IDF (t) of a term t is related


to the number of documents in the collection that contain t.
For instance, if spam appears in 4 out of the 40 documents
in the collection, its IDF (spam) score will be 10. The
TFIDF score of a page p with respect to a query q is then
computed over all common terms t:


Based on TFIDF scores, spammers can have two goals: either Figure 3: A page having unrelated hyperlinks
to make a page relevant for a large number of queries (i.e., to
receive a non-zero TFIDF score), or to make a page very
relevant for a specific query (i.e., to receive a high TFIDF
score). The first goal can be processed by including a large
number of distinct terms in a document. The second goal can
be processed by repeating some targeted terms.

2.1.1 Title spam


The title stuffing makes search engines to raise the keyword
content. Hence, it makes to have large number of the spam
terms in the document title.

2.1.2 Body spam


The body spam achieve a high ranking of a page for a limited
Figure 4: A simple example of link farm pages
set of queries by using the repetition strategy of overstuffing
[Nodes from A to F denotes web pages and the Edges
body page with terms that appear in the set of query.
denotes the Links Between the pages]
There are two major categories of link spam.
2.1.3 Meta tag spam
a) Outgoing link spam
The HTML is the target of spamming. Search engines
b) Incoming link spam.
currently give low priority to these tags, or even ignore them.

45 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security
National Conference on Research Issues in Image Analysis & Mining Intelligence (NCRIIAMI-2015)

Spammers create various pages that link to the spammers Inaccessible pages: Spammer cannot modify the web
target pages such as shown in Figure 5. pages. These are the pages out of reach; the spammer cannot
influence their outgoing links.

Accessible page: Accessible pages are maintained by


others, but can still be modified in a limited way by a
spammer. For example, a spammer may be able to post a
comment to a blog entry, and that comment may contain a
link to a spam site.

Own pages: Own pages are maintained by the spammers,


who thus have full control over their contents. We call the
group of own pages as a spam farm . A spammers goal is to
boost the importance of one or more pages. For simplicity, say
Figure 5: Scheme of a typical link farm. there is a single target page t. There is a certain maintenance
cost associated with a spammers own pages, so we can
2.2.1 Outgoing link spam assume that a spammer has a limited budget of n such pages,
The spammer manually adds a number of outgoing links to not including the target page [9].
well-known pages, hoping to increase the page rank. The most
common way for creating a vast number of outgoing links is
directory cloning. Spammers then often simply replicate some
2.3 Target algorithms
The two well known algorithms used to compute importance
or all of the pages of a directory, and thus create massive
scores based on link information is given below
outgoing link structures quickly.
2.2.2 Incoming link spam 2.3.1 PageRank
Spammers make a number of incoming links to a single target PageRank algorithm is used by Google Search Engines to
page or set of pages by using following strategies: rank websites. PageRank uses incoming link information to
assign global importance scores to all pages on the Web. It
2.2.2.1 Create a honey pot assumes that the number of incoming links to a page is related
A set of pages containing some useful resource may also have to that pages popularity among average web users [11].
hidden links to the target spam page(s). It will attract people
to point to it, boosting indirectly the ranking of the target 2.3.2 HITS
page(s). Hyperlink-Induced Topic Search (HITS also known as
hubs and authorities) is a link analysis algorithm used
2.2.2.2 Infiltrate a web directory to rank pages on a specific topic. It is more common,
The web directories allow webmasters to add links to their however, to use the algorithm on all pages on the Web
sites under some topic in the directory. Spammers may add to to assign global hub and authority scores to each page.
directory pages links that point to their target pages having According to the circular definition of HITS, important
both high Page Rank and hub scores, this spamming technique hub pages are those that point to many important authority
is useful in boosting both the Page Rank and authority scores pages, while important authority pages are those pointed by
of target pages. many hubs. A search engine that uses the HITS algorithm to
rank pages returns as query result a blending of the pages
2.2.2.3 Participate in link exchange with the highest hub and authority scores. Hub scores can
A group of spammers set up a link exchange structure, so that be easily spammed by adding outgoing links to a large
their sites point to each other. number of well known, reputable pages, such as
www.cnn.com or www.mit.edu. Thus, a spammer should add
2.2.2.4 Create own spam farm many outgoing links to the target page t to increase its hub
The spammers create arbitrary link structures that boost the
score.
ranking of some target pages [8].

2.2.2.5 Buy expired domains


3. HIDING TECHNIQUES
The spammers may have the repeated terms, long lists of
When a domain names expires, the URLs on various
links. They use a number of techniques to hide their abuse
web sites that point to pages within the expired domain for
from regular web users visiting spam pages, or from the
some time. Some spammers buy expired domains and
editors at search engine companies who try to identify spam
populate them with Spam that takes advantage of the false
instances [6, 7].
relevance/importance conveyed by the pool of old links.

There are three types of pages on the Web:

46 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security
National Conference on Research Issues in Image Analysis & Mining Intelligence (NCRIIAMI-2015)

HIDING TECHNIQUES

CONTENT HIDING CLOAKING REDIRECTION


Figure 6: Types of hiding techniques
4. CONCLUSION
In this paper we discussed about various spamming
3.1 Content hiding techniques and hiding techniques. We observed that there is
Spam terms or links on a page can be made invisible when the an on-going battle between search engines and spammers.
browser views the page. One common technique is using Due to great impact of spam on search engines and online
appropriate color schemes. The terms in the body of an community, web spam detection has become a key area of
HTML document are invisible if they are displayed in the research. Spam detection using fuzzy based approach, area
same color as the background. Color schemes can be defined of Genetic algorithms, detection of cloaking are still in open
either in the HTML document or in an attached cascading issues.
style sheet (CSS).
5. REFERENCES
Ex.: <body background=white> [1] Z. Gyongyi, H. Garcia-Molina, and J. Pedersen,
<font color=white>hidden text</font> Combating web spam with Trust Rank, In
Proceedings of the 30th International Conference on
...
Very Large Data Bases (VLDB), Toronto, Canada,
</body> Sept 2004
The spam links can be hidden by avoiding anchor text. [2] Z. Gyongyi, H. Garcia-Molina, Web Spam
Instead, spammers often create tiny pixel anchor images that Taxonomy, In Proceedings of First Workshop on
are either transparent or background-colored. A spammer can Adversarial Information Retrieval on the Web,
also use scripts to hide some of the visual elements on the 2005.
page, for instance, by setting the visible HTML style attribute [3] S. K. Jayanthi and S. Sasikala, Link Spam
Detection based on Fuzzy C-Means Clustering, I.
to false [10]. J. Next-Generation Networks, 198, pp. 1-10 2010.
[4] S. K. Jayanthi and S. Sasikala, Perceiving Link
3.2 Cloaking Spam based on DBSpamClust, in Proc. IEEE
Cloaking is a search engine optimization technique in which International Conference on Network and Computer
Science, IEEE Press, New York, 2011, pp. 31-35.
the content presented to the search engine is different from the
[5] L.Zhang, Y. Zhang, Y.Zhang, and X. Li,
user's browser. If spammers can clearly identify web crawler Exploring both Content and Link Quality for Anti-
clients they can use cloaking. By using the URL, spam web Spamming,In CIT06, Washington, DC, USA,
servers return one specific HTML document to a regular web 2006.
browser, while they return a different document to a web [6] B. Zhou, J. Pei, and Z. Tang. A spamicity approach
crawler. This way, spammers can present the intended content to web spam detection. In Proceedings of the
SIAM International Conference on Data Mining,
to the web users and, at the same time, send a spammed
SDM08, Atlanta, Georgia, April 2008.
document to the search engine for indexing. [7] N. Spirin, Jiawei Han, Survey on Web Spam
The web crawlers can be identified in two ways. On first, Detection: Principles and Algorithms, Newsletter
some spammers maintain a list of IP addresses used by search ACM SIGKDD Explorations Newsletter
engines, and identify web crawlers based on their matching archive,vol. 13, Issue 2, 2011.
IPs. On second, a web server can identify the application [8] J.Abernethy, O. Chapelle, C. Castillo, Graph
requesting a document based on the user-agent field in the regularization methods for Web spam detection,
Springer Mach Learn, pp. 207225, 2010.
HTTP request message [10].
[9] Y. Wang, Z. Qin, B. Tong, and J. Jin, Link Farm
Spam Detection Based on Its Properties, in Proc.
3.3 Redirection International Conference on Computational
Another way of hiding the spam content on a page Intelligence and Security, 2008, pp. 477-480.
[10] A. Benczur, D. Siklosi, J. Szabo, I. Bro, Z. Fekete,
is by automatically redirecting the browser to another
M. Kurucz, A.Pereszlenyi, S. Racz and A. Szabo ,
URL as soon as the page is loaded. This way the page Web Spam: a Survey with Vision for the
still gets indexed by the search engine, but the user will Archivist, in Proc. IWAW08 International Web
not see itpages with Redirection act as intermediates for Archiving Workshop, 2008.
the ultimate targets, which spammers try to serve to a user [11] "Facts about Google and Competition". Archived
reaching their sites through search engines. from the original on 4 November 2011. Retrieved
12 July 2014.

47 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security
National Conference on Research Issues in Image Analysis & Mining Intelligence (NCRIIAMI-2015)

An Efficient Analysis on Feature Selection to Evaluate the


Students Performance based on Various Methods
D. Jemimah Sweetlin J. Jebakumari Beulah Vasanthi Dr. P. Raajan
Research Scholar Head, Department of CS&IT Assistant Professor of MCA
Ayya Nadar Janaki Ammal College Ayya Nadar Janaki Ammal College Ayya Nadar Janaki Ammal College
Sivakasi Sivakasi Sivakasi
TamilNadu, India TamilNadu, India TamilNadu, India
jemimah.sweetlin@gmail.com jebaarul07@gmail.com raajanp99@gmail.com

ABSTRACT
In the present scenario, Educational Data Mining is one of In this paper Feature selection techniques are applied in
the emerging disciplines including the process of analyzing EDM process have been described. Data mining
the students information using different attributes, such as applications and techniques can be used to develop the
Medium of study, First Graduate, Residence, Location, education sector.
Family, Fathers Qualification, Mothers Qualification,
Annual Income, HSC Marks, and PSM. Based on the PSM 2. LITERATURE REVIEW
and other attributes the fore coming semester performance
will be predicted. Feature selection techniques are applied Acharya et al. [3] have presented a method on educational
to these attributes. During Feature selection irrelevant data mining. Feature selection techniques were applied and
attributes have been removed, relevant attributes have been relevant attributes only selected for further use. Initially 15
selected for further reference. Three feature selection features were used such as Gender, caste, religion Fsize,
methods are used and the best feature selection method is Board, Sorigin, Income, Boardmarks, Hday, Atten,
found out. Midsem, Medium, School, Ptution, and Grade. 3 feature
selection techniques were applied such as, Correlation
Keywords Based Feature Selection (CBFS), Chi-Square Based
Educational Data Mining (EDM), Feature Selection Feature Evaluation, and Information Gain Attribute
techniques, PSM (Previous Semester Marks), Information Evaluation (IGATE). After applying feature selection 8
Gain Attribute Evaluation (IGATE), Gain Ratio Based attributes were selected for further use. They compared
Feature Selection (GRFS). those feature selection techniques and results are
visualized.
1. INTRODUCTION
Ramaswamy et al. [4] have developed a technique on
Data mining techniques are used to extract the meaningful educational data mining. The data were collected from
information from a large data source using some patterns different schools in different districts of the state Tamil
and methods. It has been used in many applications such as Nadu. Initially the data set has 32 features. They use 6
EDM, Web Mining, Text mining etc. It is an emerging +feature selection techniques on their data set to select the
discipline, concerned with developing methods for best attributes. CBFS, CHFS, GRFS, IGFS, and RF. After
exploring the unique type of data that originates from applying feature selection 9 features only selected. They
educational settings, and using such methods to better compare all the feature selection methods and choose
understand the students [1]. Methods are different from which method is best one, that method were applied.
standard data mining methods. There are 3 main goals in
EDM, [2] Hany et al. [5] have developed a method to select the
optimal features for the student model. Data were collected
Pedagogical: to help in the design of didactic contents from suburban middle school in central Massachusetts.
and improvement on the academic performance of the Initially the data set has 15 features. They use Weka tool
students. for selecting the best attributes. Weka as an open source
machine learning software has been used to conduct an
Managerial: to optimize the organization and
extensive comparison of the six classifier algorithm.
maintenance of education infrastructures, areas of
interest and study researches. A classification technique has been described by Baradwaj
Commercial: to help in students recruitment in any et al. [6]. The main aim is to improve the students
performance. One of the Classification technique Decision
private education. tree was used. In this paper ID3 algorithm was used to
predict and improve the students performance. They

48 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security
National Conference on Research Issues in Image Analysis & Mining Intelligence (NCRIIAMI-2015)

collected the data from the VBS Purvanchal University, ATT: Students Attendance. It splits into 3 classes, that is,
Jaunpur Computer Applications department, of course Good above 95, Average 80-94, Poor below 80.
MCA 2007-2010. Attendance, Class test, Seminar and
Assignment marks were collected from the student HSCM: Higher Secondary Marks. Based on their marks, it
database, to predict the performance of the end of the will be categorized into 3 groups, such as Good Above
semester. Which is help the teachers to improve the 60, Average 40 - 59, Poor - below 40.
students performance. PSM: Previous Semester Marks/Grade obtained in MCA
3. DATA SET DESCRIPTION course. It is split into five class values: Grade A - Above
80, Grade B 65-79, Grade C 51-64, Grade U - Below
3.1 Data Collection 50.
All variables are the categorical values. PSM is the
The data have been collected from Ayya Nadar Janaki
dependent variable, all other variables are the predictor
Ammal College, Sivakasi, TamilNadu, India. The sampling
variables.
method of computer Applications department (Master of
Computer Applications). In this method all individual table 4. METHODOLOGY
are combined into a single table formed with all sufficient 4.1 Feature Selection
data.
Feature selection, also known as variable
3.2 Data Preperation selection, attribute selection or variable subset selection, is
the process of selecting a subset of relevant features for use
Initially the size of the data is 47. We have conducted the
in model construction [7].
survey to collect their Mark details, personal information as
well as family background from the students. The attributes Feature selection has proven in both theory and practice to
such as Name, Med, FG, Resi, LivLoc, Fsize, Fedu, Medu, be the effective enhancing learning efficiency, increasing
Finc, HSC, and PSM. Three tables (Mark details, Family predicting accuracy and reducing complexity of learned
details, Personal Details) were merged for this process. The results [8]. Feature selection in supervised learning has a
data set values for some of the variable were defined for the main goal of finding a feature that produces higher
present analysis. classification Accuracy [9].
Table1. Attribute Description
Three filter feature subset methods was used.
Attributes Description Possible Values 4.1.1 Chi-Squared Feature Selection
One of the popular feature selection method is chi-square
MED Medium of study Tamil, English 2
(X ). It is a statistical test that can be used to determine
FG First Graduate Yes, No whether observed frequencies are significantly different
from expected frequencies. Which is used to select the
RESI Residence Day Scholar, Hostel predictor variable. [10]

LIVLOC Living Location Urban, Rural


(1)
FSIZE Family Size 2, 3, 4, >4
Fathers PG, UG, HSC,
FEDU O = Observed Frequency
Qualification SSLC, Others, Nil
Mothers PG, UG, HSC, E = Expected Frequency
MEDU
Qualification SSLC, Others, Nil
Family Annual Based on observed frequency and expected frequency chi
FINC High, Medium, Low square filter has been calculated. The selected attributes are
Income
described in Table 2. That is MEDU, FG, FEDU, ATT, and
ATT Attendance Good, Average, Poor MED. Here PSM is the Dependent variable.
Higher Secondary
HSCM Good, Average, Poor 4.1.2 Information Gain Feature Selection
Marks
Previous Semester Grade A, Grade B, Information Gain Feature Selection is another standard
PSM method. The purpose of these techniques is to discard
Marks Grade C, Grade U
irrelevant or redundant features from a given data set.
FINC: Family Annual Income. Students annual income, it
During Information gain feature selection, entropy value
will be categorized into 3 classes: High- Above 100000Rs,
has been calculated. [11]
Medium 30000Rs-99999Rs, Low Below 29999Rs.

m
Entropy ( D ) = pi log 2 ( pi ) (2)
i =1

v | Dj |
InformationGain A ( D) = Entropy ( D j )
j =1 |D|

49 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security
National Conference on Research Issues in Image Analysis & Mining Intelligence (NCRIIAMI-2015)

Gain Ratio HSC RESI MEDU FG FINC FEDU


(3)
Feature selection methods have been applied on the data
D = Data Set, Dj = Entropy, A = Attribute. Information set, for selecting relevant attributes. In Table 2 three feature
have been gained from the Entropy. MEDU, FEDU, FINC, selection methods results are described. Using R language
ATT, RESI, and FG these are the selected attributes using feature selection method are applied on data set, and
the Information Gain Feature Selection Method which is relevant attributes are selected.
shown in Table 2. We concentrated on selecting the best attribute. From Table
4.1.3 Gain Ratio Feature Selection 2 Gain Ratio Feature Selection Method is the best Feature
Most Popular feature selection method is Gain Ratio selection method.
Feature Selection method. The subset have been selected The selected attributes are HSC, RESI, MEDU, FG, FINC,
using the entropy value as well as the information gain and FEDU. The outcome of the feature selection would be
value. From the given equation Gain Ratio subset selector a rank list of predictors dependent variable and the result
have been produced. The selected attributes are shown in of the feature selection are presented in a bar chart
Table 2, such as HSC, RESI, MEDU, FG, FINC, and [14].These 6 are the top predictor attribute which gives the
FEDU. best result. Gain Ratio Importance was displayed Figure 1

Gain Ratio (A) = Entropy (D) Information Gain A (D)

(4)
It is used to improve the performance of classification
algorithm and find out the best feature selection method
[12].

5. RESULT AND DISCUSSION


The data set of 47 students used in this study was obtained
from Ayya Nadar Janaki Ammal College Sivakasi,
Tamilnadu. Computer Applications Department of Course
MCA. Students Mark details, Personal information and
Family Background have been collected through survey. It
has missing some values Fig 1: Gain Ratio Importance (PSM is the Dependent
Variable)
R language was used to preprocess the data. The R
statistical programming language is a free open source Time taken by each method was calculated using R
package. The R language is widely used language.
among statisticians and data miners for The feature selection using Chi-squared method takes very
developing statistical software [13]. In Pre-processing long time (0.20 Milliseconds), Information gain method
missing values are omitted. And then FINC, ATT, HSC, takes (0.14 Milliseconds), and Gain Ratio method takes
and PSM was categorized using If Then Rules. In very less time (0.08 Milliseconds) Shown in Figure 2
dependent variable, (PSM) another one checking was
obtained that is, in every subject external and internal
marks were calculated for the total. The Internal mark and
external mark should be above 23, if it is true the candidate
comes under Pass Category, or else the candidate comes
under Fail Category. After removing the missing values all
the tables (Mark details, Personal information, Family
Background) were merged together. Now the size of the
data is 44.
Table 2. Selected Attributes

Methods Attributes

Chi-Squared MEDU RESI FG FEDU ATT MED


Fig 2: Time Complexity of three Feature Selection
Information MEDU FEDU FINC ATT RESI FG Methods
Gain

50 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security
National Conference on Research Issues in Image Analysis & Mining Intelligence (NCRIIAMI-2015)

Compare to Chi-Squared method and Information gain [12] Priyadarsini, R. P., Valarmathi, M. L., & Sivakumari,
method, Gain Ratio is the best method. Which selects the S. GAIN RATIO BASED FEATURE SELECTION
best attributes. METHOD FOR PRIVACY PRESERVATION.
We have concentrated on some experiments in order to [13] http://en.wikipedia.org/wiki/R_(programming_langua
evaluate the performance and usefulness of different ge)
feature selection method for selecting the best attributes
[14]. [14] Mar Pal, A. K., and Pal, S. 2013. Analysis and Mining
of Educational Data for Predicting the Performance of
The result of the comparative study of three different Students. IJECCE 4(5).
feature selection method are described in table Shown in
Table 2.

6. CONCLUSION
In this paper, we carried out an analysis study of 3 Feature
selection methods. According to the subset selection
method the best attributes have been selected. Based on the
time complexity we conclude that Gain ratio is the best
method for feature selection, which takes the less time
compare to chi-squared and information gain method. In
future, this approach have been used for further use. This
method can be applied to different type of data sets which
are required for dimensionality reduction and classification

7. REFERENCES
[1] Baker, R. S., and Yacef, K, "The state of educational
data mining in 2009: A review and future
visions." JEDM-Journal of Educational Data
Mining1.1 (2009): 3-17
[2] Barracosa, J. I. M. S. 2011. Mining Behaviors from
Educational Data.
[3] Anal, A., and Devadatta, S. 2014. Early Prediction of
students using Machine Learning Techniques. IJCA
International Journal of Computer Applications,
107(1).
[4] Ramaswami, M., and Bhaskaran, R. 2009. A study on
feature selection techniques in educational data
mining. arXiv preprint arXiv:0912.3924.
[5] Harb, H. M., and Moustafa, M. A. 2012.
SelectingOptimal Subset of Features for
StudentPerformance Model. International Journal of
Computer Science Issues (IJCSI), 9(5).
[6] Baradwaj, B. K., and Pal, S. 2012. Mining educational
data to analyze students' performance. arXiv preprint
arXiv:1201.3417.
[7] http://en.wikipedia.org/wiki/Feature_selection
[8] Miller, A. 2012. Subset selection in regression. CRC
Press.
[9] Koller, D., and Sahami, M. 1996. Toward optimal
feature selection.
[10] http://nlp.stanford.edu/IRbook/html/htmledition/chi-
square-feature-selection-1.html
[11] Azhagusundari, B., and Thanamani, A. S. Feature
selection based on information gain. International
Journal of Innovative Technology and Exploring
Engineering (IJITEE) ISSN, 2278-3075.

51 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security
National Conference on Research Issues in Image Analysis & Mining Intelligence (NCRIIAMI-2015)

Improving Performance of Energy Efficient Zone Technique


using Location Aided Routing Protocol for MANET
R.Sankarasubramanian Dr.S.Sukumaran Dr.G.Anandharaj
Associate Professor Associate Professor Associate Professor
Department of Computer Science Department of Computer Science Department of M.C.A
Erode Arts and Science College Erode Arts and Science College GanadipathyTulsis Jain Eng.
(Autonomous), Erode 638009 (Autonomous), Erode 638009 College, Vellore 632 102
rsankarprofessor@gmail.com prof_sukumar@yahoo.co.in younganand@gmail.com

ABSTRACT [4]. Conventional routing protocols are insufficient for ad hoc


A Mobile ad-hoc network consists of a set of mobile nodes networks, since the amount of routing related traffic may
which forms a temporary network without using any waste a large portion of the wireless bandwidth, especially for
centralized point. A mobile ad hoc network consists of protocols that use periodic updates of routing tables.
wireless hosts that may move often random location. 1.1 Mobile Ad Hoc Networking (MANET)
Movement of hosts results in a change in routes with balanced Mobile Ad hoc Networking (MANET) is a group of
bandwidth receivers. Several routing protocols have already independent network mobile devices that are connected over
been proposed for ad hoc networks signals. This paper various wireless links. It is relatively working on a
suggests an approach to utilize location information to constrained bandwidth. The network topologies are dynamic
improve performance of routing protocols for ad hoc networks and may vary from time to time. Each device must act as a
particular frequency. By using location information, the router for transferring any traffic among each other. This
proposed Location-Aided Routing (LAR) protocols limit the network can operate by itself or incorporate into large area
search for a new route to a smaller request zone of the ad network (LAN).There are three types of MANET. It includes
hoc network controlled the transferring data. This results in a Vehicular Ad hoc Networks (VANETs), Intelligent Vehicular
significant reduction in the number of routing messages. Ad hoc Networks (InVANETs) and Internet Based Mobile Ad
Some of the technical challenges MANET poses are also hock Networks (iMANET).
presented, points out some of the key research issues for ad The set of application for MANETs can be ranged
hoc networking technology that are expected to promote the from small, static networks that are limited by power sources,
development and accelerate the commercial applications of to large-scale, mobile, highly dynamic networks. On top of
the MANET technology for real time environment. Special that, the design of network protocols for these types of
attention is paid on network layer routing strategy of MANET networks is face with multifaceted issue. Apart from of the
and key research issues include new X-cast routing application, MANETs need well-organized distributed
algorithms, Quality service, and mechanisms for interworking algorithms to determine network organization, link
with outside IP networks zones. The proposed solution should scheduling, and routing. Conventional routing will not work
avoid flooding and creating the new routing metric. In in this distributed environment because this network topology
addition to this, our new routing metric is based on multi can change at any point of time. Therefore, we need some
criteria optimization in order to minimize delay and blocking sophisticated routing algorithms that take into consideration
probability of data. The routing protocol observes real this important issue (mobile network topology) into account.
network parameters and real network environments. While the shortest path (based on a given cost function) from
a source to a destination in a static network is usually the
Keywords optimal route, this idea is not easily far-reaching to MANETs.
Mobile Communications, Wireless Networks, Ad hoc Some of the factors that have become the core
Networking, Pervasive Computing, Routing Algorithm. issues in routing include variable wireless link quality,
propagation path loss, fading, interference; power consumed,
1. INTRODUCTION and network topological changes. This kind of conditions
Mobile ad hoc networks consist of wireless mobile being provoked in a military environment because, beside
hosts that communicate with each other, in the absence of a these issues in routing, we also need to guarantee assets
fixed infrastructure.1 Routes between two hosts in a Mobile security, latency, reliability, protection against intentional
Ad hoc NETwork (MANET) may consist of hops through jamming, and recovery from failure. Failing to abide to of any
other hosts in the network [1]. Host mobility can cause of these requirements may downgrade the performance and
frequent unpredictable topology changes. Therefore, the task the dependability of the network. 3. Location-Aided Routing
of finding and maintaining routes in MANET is nontrivial. (LAR) protocols.
Many protocols have been proposed for mobile ad hoc
networks, with the goal of achieving efficient routing [2,3]. 2. RELATED WORK
These algorithms differ in the approach used for searching Design of routing protocols is a crucial problem in
anew route and/or modifying a known route, when hosts mobile ad hoc networks [5], and several routing algorithms
move. Design of routing protocols is a crucial problem in have been developed. One desirable qualitative property of a
mobile ad hoc networks, and several routing algorithms have routing protocol is that it should adapt to the traffic patterns.
been developed. One desirable qualitative property of a Point out that conventional routing protocols are insufficient
routing protocol is that it should adapt to the traffic patterns for ad hoc networks, since the amount of routing related
traffic may waste a large portion of the wireless bandwidth,

52 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security
National Conference on Research Issues in Image Analysis & Mining Intelligence (NCRIIAMI-2015)

especially for protocols that use periodic updates of routing hereafter, node S will be referred to as the sender and node D
tables. They proposed using Dynamic Source Routing (DSR), as the destination. A node, say X, on receiving a route request
which is based on on-demand route discovery. A number of message, compares the desired destination with its own
protocol optimizations are also proposed to reduce the route identifier. If there is a match, it means that the request is for a
discovery overhead. The AODV (Ad hoc On demand route to itself (i.e., node X). Otherwise, node X broadcasts the
Distance Vector routing) protocol that also uses a demand- request to its neighbors to avoid redundant transmissions of
driven route establishment procedure. TORA (Temporally- route requests, a node X only broadcasts a particular route
Ordered Routing Algorithm) [6] is designed to minimize request once (repeated reception of a route request is detected
reaction to topological changes by localizing routing-related using sequence numbers).Figure 1 illustrates this algorithm. In
messages to a small set of nodes near the change. Hass and this figure, node S needs to determine a route to node D.
Pearlman [7] attempt to combine proactive and reactive Therefore, node S broadcasts a route request to its neighbors.
approaches in the Zone Routing Protocol (ZRP), by initiating When nodes B.
route discovery phase on-demand, but limiting the scope of
the proactive procedure only to the initiators local
neighborhood. Recent papers present comparative
performance evaluation of several routing protocols [7].
The previous MANET routing algorithms do not
take into account the physical location of a destination node.
In this paper, we propose two algorithms to reduce route
discovery overhead using location information. Similar ideas
have been applied to develop selective paging for cellular
PCS (Personal Communication Service) networks [8].In
selective paging, the system pages a selected subset of cells
close to the last reported location of a mobile host. This
allows the location tracking cost to be decreased. We propose
and evaluate an analogous approach for routing in MANET.
In a survey of potential applications of GPS, The location
information in ad hoc networks, though they do not elaborate
on how the information may be used. Other researchers have
also suggested that location information should be used to
improve (qualitatively or quantitatively) performance of a Fig 1: Illustration of flooding
mobile computing system. A packet radio system using
location information for the routing purpose. 3.2 Preliminaries
The network infrastructure consists of fixed base
Location information The proposed approach is
stations whose precise location is determined using a GPS
termed Location-Aided Routing(LAR), as it makes use of
receiver at the time of installation. A geographically based
location information to reduce routing overhead. Location
routing scheme to deliver packets between base stations.
information used in the LAR protocol may be provided by the
Thus, a packet is forwarded one hop closer to its final
Global Positioning System (GPS) [10]. With the availability
destination by comparing the location of packets destination
of GPS, it is possible for a mobile host to know its physical
with the location of the node currently holding the packet. A
location.3 In reality, position information provided by GPS
routing and addressing method to integrate the concept of
includes some amount of error, which is the difference
physical location(geographic coordinates), into the current
between GPS-calculated coordinates and the real coordinates.
design of the Internet, has been investigated in [8]. Recently,
For instance, NAVSTAR Global Positioning System has
another routing protocol using location information has been
positional accuracy of about 50100 m and Differential GPS
proposed [3]. This protocol, named DREAM, maintains
offers accuracies of a few meters [9].
location information of each node in routing tables and sends
The proposed algorithm metrics is based on the two
data messages in a direction computed based on these
essential parameters: bandwidth and load. Two additional
routing(location) tables. To maintain the location table
parameters: hop numbers and delay, are optional. In this way
accurately, each node periodically broadcasts a control packet
optimal usage of network resources, as well as possibility for
containing its own coordinates, with the frequency of
guaranteed QoS should be achieved. The logic of proposed
dissemination computed as a function of the nodes mobility
routing protocol is presented by the block scheme inThe main
and the distance separating two nodes (called the distance
idea is to divide routing protocol into two phases:
effect). Unlike [3], we suggest using location information for
(a) As fast as possible association of a new user,
route discovery, not for data delivery.

3. LOCATION-AIDED ROUTING (LAR) (b) Finding the optimal route for a new user, considering its
request
PROTOCOLS
3.1 Route discovery using flooding
In this paper, we explore the possibility of using
location information to improve performance of routing
protocols for MANET. As an illustration, we show how a
route discovery protocol based on flooding can be improved.
The route discovery algorithm using flooding is described
next(this algorithm is similar to DSR [8] and AODV
[9]).When a node S needs to find a route to node D, node S
broadcasts a route request message to all its neighbors2

53 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security
National Conference on Research Issues in Image Analysis & Mining Intelligence (NCRIIAMI-2015)

above). Additionally, the request zone may also include other


regions around the request zone.

Fig 4: Request zone. An edge between two nodes means


that they are neighbors

3.3 Determining membership of request


zones
As noted above, our LAR algorithms are essentially
identical to flooding, with the modification that a node that is
Fig 2: Flowchart for routing protocol not in the request zone does not forward a route request to its
neighbors.4 Thus, implementing LAR algorithm requires that
3.2.1 Expected zone and request zone a node be able to determine if it is in the request zone for a
Expected zone : Consider a node S that needs to particular route request the two LAR algorithms presented
find a route to node D. Assume that node S knows that node D here differ in the manner in which this determination is made.
was at location L at time t0, and that the current time ist1. 4. PERFORMANCE EVALUATION
Then, the expected zone of node D, from the viewpoint of To evaluate our schemes, we performed simulations
node S at time t1, is the region that node S expects to contain using modified version of a network simulator, NS2(Network
node D at time t1. Node S can determine the expected zone Simulator 2). NS2 is a discrete-event simulator built to
based on the knowledge that node D was at location L at time provide a flexible platform for the evaluation and comparison
t0. For instance, if node S knows that node D travels with of network routing algorithms. Routing protocols were
average speed v, then S may assume that the expected zone is simulated flooding, LAR scheme. We studied several cases
the circular region of radius. If actual speed happens to be by varying the number of nodes, transmission range of each
larger than the average, then the destination may actually be node, and moving speed.
outside the expected zone at time t1. Thus, expected zone is 4.1 Simulation model
only an estimate made by node S to determine a region that Number of nodes in the network was chosen to be
potentially contains D at time t1. In general, it is also possible 15, 30 and 50 for different simulation runs. The nodes in the
to define v to be the maximum speed (instead of the average) ad hoc network are confined to a 1000 unit 1000 unit square
or some other measure of the speed distribution. region. Initial locations (X and Y coordinates) of the nodes
are obtained using a uniform distribution. We assume that
each node moves continuously, without pausing at any
location. Each node moves with an average speed v. (In
simulations presented here and in [11], average speed of
mobile nodes is used to define the expected zone for LAR
scheme 1. Results reported in 122] used maximum speed
instead.) Each node makes several moves during the
simulation. A node does not pause between moves. During a
given move, a node travels distance d, where d is
exponentially distributed with mean 20 units.
The direction of movement for a given move is
chosen randomly. For each such move, for a given average
Fig 3: Examples of expected zone speed v, the actual speed of movement is chosen uniformly
distributed between nodes. If during a move (over chosen
3.2.2 Request zone distance d), a node hits a wall of the 10001000 region, the
node bounces and continues to move after reflection for the
Again, consider node S that needs to determine remaining portion of distance d. Two mobile hosts are
route to node D. The proposed LAR algorithms use flooding considered disconnected if they are outside each others
with one modification. Node S defines (implicitly or transmission range. All nodes have the same transmission
explicitly) a request zone for the route request. A node range. For the simulations, transmission range values of 200,
forwards a route request only if it belongs to the request zone 300, 400, and 500 units were used. All wireless links have the
(unlike the flooding algorithm in section 3.1).To increase the same bandwidth, 100 Kbytes per second.
probability that the route request will reach node D, the As the average speed is increased, for a given
request zone should include the expected zone(described simulation time, the number of moves simulated increases. If
simulation time is kept constant, as speed is increased, a

54 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security
National Conference on Research Issues in Image Analysis & Mining Intelligence (NCRIIAMI-2015)

particular configuration (for instance, partition) that may not


have occurred at a lower speed can occur at the higher speed.
Therefore, we chose to vary simulation time inversely with
average speed. (On a related note, observe that a configuration
that did occur at a lower speed unavoidably lasts a shorter
time when the speed is higher.)For the simulation, a sender
and a destination are chosen randomly. Any data packets that
cannot be delivered to the destination due to a broken route
are simply dropped.
The source generates 10 data packets per second (on
average), with the time between two packets being
exponentially distributed. The data rate was chosen low to
speed up the simulation. However, this has the impact of
sending small number of packets between two route
discoveries (as compared to when the source continuously Fig 5.The performance of the routing protocolsAverage
sends packets).This, in turn, results in higher number of hop count
routing packets per data packet (defined below).When using 4.2.2. Blocking Iterations
the LAR schemes for route discovery, the sender first uses our This metrics can be expressed as the number of iterations
algorithm to determine a route if a route reply is not before blocking the traffic particular user and is often used in
received within a timeout interval, the sender uses the optical networks or in multi-channel networks blocking
flooding algorithm to find the route. The timeout interval is 2 probability quantification. If we suppose that new users
s on average (specifically, the timeout interval is equal to the generate the same traffic all the time during connection
time required to generate 5 data packets). (independent of sending or receiving processes) the routing
4.2. Simulation results algorithm should reserve enough bandwidth on every link on
Initially, we assume that a node knows its current location the route. Depending on the routing algorithm and its ability
accurately, without any error. At the end of this section, we to use available links in an optimal way traffic jams can be
briefly consider the impact of location error on performance avoided in more or less extent. The neural network method
of our algorithms. In the following, the term data packets exhibits again the best results, as depicted in Figure, and
(or DP) is used to refer to the data packets received by the regarding the optimization of network resources. Moreover, t,
destination the number of data packets received by the as depicted in Figure.
destination is different from number of data packets sent by
the sender, because some data packets are lost when a route is
broken. In the following, the term routing packets (or RP)is
used to refer to the routing related packets (i.e., route request,
route reply and route error) received by various nodes
number of such packets received is different from number of
packets sent, because a single broadcast of a route request
packet by some node is received by all its neighbors.
Table 1. Performance metrics for simulation

Parameters Value
Simulator GlomoSim
Protocol Studied AODV, DSR, LAR
Fig 6: The performance of the routing protocols
Simulation Time 50s, 100s, 150s, 200s, 250s,
Blocking Iterations
Terrain Dimension 2000*2000
No.of Nodes 25, 50, 100, 150, 200 After analyzing all the presented results we can
Node Placement Random conclude that the proposed routing algorithm based on the
Node Mobility model RWP artificial neural network could be efficiently used in wireless
mesh networks. In more than 96% cases of intensive
Bandwidth 2 MBs simulations under different and randomly changed network
Traffic Type CBR and traffic conditions, the proposed protocol was better than
4.2.1. Average Hop Count AODV and OSPF. It must be stressed that the NN simulation
Average hop count (measured as the number of hops) is an was performed by digital computer, meaning, by working in
average number of used particular links in the route. This sequential mode which is unnatural for neural networks. It is
metric indicates how good the routing protocol is regarding well-known that neural networks exhibit significantly better
the network resources. Sometimes, routes with minimal hop results in their hardware realization when the natural parallel
count need not be optimal from the point of all the observed mode of operation is possible, producing dramatically shorter
network parameters. However, as less links are involved in execution times, but in our research we concentrated on the
the final route, the selection is better. In our case, the architecture and organization of a new routing protocol which
proposed NN algorithm has the smallest average hop count includes the artificial intelligence in the routing process.
(Figure 20), due to the fact that the NN can find the Pareto 4.2.3. Packet Delivery Ratio
optimal route and then the network resources are optimally The packet delivery ratio (PDR) is defined as the ratio
used. between the packets that are received and the number of
packets sent. This is one of the most used metrics for protocol

55 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security
National Conference on Research Issues in Image Analysis & Mining Intelligence (NCRIIAMI-2015)

comparison. In our simulations the proposed NN model for describing the performances of the routing algorithms. It is
exhibits the best performance, then the OSPF, as depicted in shown that the proposed routing protocol has better or the
same performance in all metrics except one, even though the
neural network is simulated by a digital computer. Our
protocol is scalable and it is adapted for dynamic network
topology and real network environments. We believe that the
performance of the proposed routing protocol can be further
improved if we use multi-radio or multi-channel routing. We
will explore this option in our future work. Also, hardware
realization of the neural network will significantly improve
the execution time for the optimization and decision
processes.
6. REFERENCES
[1] Akyildiz, S.M. Joseph and Yi-Bing Lin, Movement-
based location update and selective paging for PCS
Fig 7: The performance of the routing protocolsPacket
networks, IEEE/ACM Transactions on Networking 4
delivery ratio
(1996) 94104.
4.2.4 Throughput [2] C. Alaettinoglu, K. Dussa-Zieger, I. Matta and A.U.
The throughput between two nodes is expressed as the Shankar, MaRS users manual version 1.0, Technical
number of bytes delivered per unit of time. Report TR 91-80, The University of Maryland (June
Formally: Throughput=Total bytes received/Total time 1991).
The throughput (measured by bytes per seconds) we have [3] S. Basagni, I. Chlamtac, V.R. Syrotiuk and B.A.
calculated as a function of the traffic load (expressed as the Woodward, A distance routing effect algorithm for
number of equal-sized packets per second), for different mobility (DREAM), in: Proc. Of ACM/IEEE
routing protocols, and results are depicted in Figure 18. The MOBICOM 98 (1998).
proposed NN algorithm gives significantly better results than [4] J. Broch, D.A. Maltz, D.B. Johnson, Y.-C. Hu and J.
AODV and OSPF. This is based on fact that the proposed Jetcheva, A performance comparison of multi-hop
algorithm is tailored for finding the optimal routes and thus wireless ad hoc network routing protocols, in: Proc. of
increasing the number of packets which reach the destination. MOBICOM 98 (1998).
[5] S. Corson, S. Batsell and J. Macker, Architectural
considerations for mobile mesh networking (Internet
draft RFC, version 2), in: Mobile Ad-hoc Network
(MANET) Working Group, IETF (1996).
[6] S. Corson and A. Ephremides, A distributed routing
algorithm for mobile wireless networks, Wireless
Networks (1995) 6181.
[7] S. Corson and J. Macker, Mobile ad hoc networking
(MANET): Routing protocol performance issues and
evaluation considerations (Internet-draft), in: Mobile Ad-
hoc Network (MANET) Working Group, IETF (1998).
[8] S.R. Das, R. Castaneda, J. Yan and R. Sengupta,
Comparative performance evaluation of routing
Fig 8: The performance of the routing protocols protocols for mobile, ad hoc networks, in: Proc. of IEEE
Throughput IC3N 98 (1998).
5. CONCLUSION [9] LuoJunhai, Ye Danxia, et al., Research on topology
This paper describes how location information may discovery for IPv6 networks, IEEE, SNPD 2007 3
beused to reduce the routing overhead in ad hoc networks. We (2007)804809.
present two location-aided routing (LAR) protocols. These [10] S. Toumpis, Wireless ad-hoc networks, in: Vienna
protocols limit the search for a route to the so-called request Sarnoff Symposium, Telecommunications Research
zone, determined based on the expected location of the Center, April2004.
destination node at the time of route discovery. Simulation [11] O. Tariq, F. Greg, W. Murray, On the effect of traffic
results indicate that using location information results in model to the performance evaluation of multicast
significantly lower routing overhead, as compared to an protocols in MANET, Canadian Conference on Electrical
algorithm that does not use location information. We also and Computer Engineering (2005) 404407.
suggest several optimizations on the basic LAR schemes [12] X. Chen, J. Wu, Multicasting Techniques in Mobile Ad-
which may improve performance. Further work is required to hoc Networks, Computer Science Department, South
evaluate efficacy of these optimizations, and also to develop West Texas State University, San Marcos.
other ways of using location information in ad hoc networks, [13] Sarkar, S.K.; Basavaraju, T.G.; Puttamadappa, C. Ad
for instance to improve performance of reactive algorithms Hoc Mobile Wireless Networks: Principles, Protocols
such as TORA [27,28], or to implement location-based and Applications; Auer Bach Publications: New York,
multicasting. A lot of implementations of wireless mesh NY, USA, 2007
networks and their significant advantages over competing [14] Hamid, R.S.; Ulman, R.; Swami, A.; Ephremides, A.
technologies need every day improvements from the point of Wireless Mobile Ad Hoc Networks; Hindawi Publishing
QoS. The main part of the desired quality and network Corporation: New York, NY, USA, 2007
stability is the routing protocol. [15] Somya Jain, Deepak Aggarwal, Performance Evaluation
The proposed routing protocol is compared with of Routing Protocols for MAC Layer Models, IOSR
two well-known routing protocols. We used several metrics Journal of Computer Engineering (IOSR-JCE) e-ISSN:

56 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security
National Conference on Research Issues in Image Analysis & Mining Intelligence (NCRIIAMI-2015)

2278-0661, p- ISSN: 2278-8727Volume 10, Issue 4


(Mar. - Apr. 2013), PP 71-77
[16] Dr. S. P Setty& B. Prasad; Professor in CS & SE,
Andhra University, Visakhapatnam., Comparative Study
of Energy Aware QoS for Proactive and Reactive
Routing Protocols for Mobile Ad-hoc Networks,
International Journal of Computer Applications (0975
8887) Volume 31 No. 5, October 2011
[17] Geetha Jayakumar and Gopinath Ganapathy,
Performance Comparison of Mobile Ad-hoc Network
Routing Protocol, International Journal of Computer
Science and Networks Security (IJCSNS), VOL.7 No.11,
pp. 77-84, November, 2007.
[18] Nor Surayati Mohamed Usop, Azizol Abdullah, Ahmad
Faisal AmriAbidin, Performance Evaluation of AODV,
DSDV, & DSR Routing Protocol in Grid Environment,
IJCSNS International Journal of Computer Science and
Network Security, VOL.9 No.7, (July, 2009).
[19] Natrajan Meghanathan A Location Prediction-Based
Reactive Routing Protocol to Minimize the Number of
Route Discoveries and Hop Count per Path in Mobile ad-
hoc Networks - Department of Computer Science,
Jackson State University, Jackson, MS 39217,
USA,,(2010).
[20] Ko, Young-Bae, Vaidya N.H., Location-Aided Routing
(LAR) in mobile ad hoc network, Wireless Networks,
vol 6, pp. 307-321, 2000.

57 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security
National Conference on Research Issues in Image Analysis & Mining Intelligence (NCRIIAMI-2015)

Problems Encountered in Securing Mobile Database and


their Solutions
D. Roselin Selvarani Dr. T. N. Ravi
Department of Computer Science, Department of Computer Science,
Holy Cross College (Autonomous), Periyar E.V.R. College,
Bharathidasan University, Bharathidasan University,
Tamil Nadu, India Tamil Nadu, India.
drsrani09@gmail.com proftnravi@gmail.com

ABSTRACT provide the problems and solutions for mobile database


Today many organizations encourage employees mobility to security.
conduct business rather than working inside their physical
boundaries in order to increase their revenue. This is made The rest of this paper is organized as follows: Section II
possible due to the fast development of light weight devices provides review of the literature on mobile database security.
and wireless technologies. When the employee moves, along Section III discusses the basic concepts such as architecture,
with him the mobile device as well as the database stored in advantages and challenges of mobile database. Section IV
the mobile device (the mobile database) also moves. It is presents a mobile database security under various headings.
highly critical because of the sensitivity and significance of Section V gives suggestions for the organizations
business data stored in the device. Therefore security of the implementing the mobile database and finally section VI
mobile database is an important issue to be considered. The concludes the paper.
main objective of the database security is the protection of
data against accidental or intentional loss. The paper aims to 2. REVIEW OF LITERATURE
analyze the problems encountered in securing mobile database In [3], the authors presented security issues of Mobile
and provide the possible solutions to overcome these database system as well as Mobile network and discussed the
problems. It also presents some of the suggestions to the solutions for it. They classified the security issues in four
organizations, where the concept of mobile database is different areas such as Security of mobile device, security of
implemented. operating system on mobile device, security of mobile
database and security of mobile network. They also identified
Keywords - Mobile database, Security, Problems, a set of vulnerabilities on mobile database and provided some
Solutions, Architecture, Suggestions techniques to decrease the side effect of vulnerability of
mobile database. In [4], the authors focused on the concepts of
Mobile Database security. They briefly summarized the basic
1. INTRODUCTION requirements of mobile database security, mobile device
Security is an important issue to be considered in mobile security and mobile network security. They also provided
computing environment. The popularity of Mobile computing problems, security challenges and solutions for mobile
is increasing day by day because of the fast development of distributed database. In [5], the authors presented the basic
light weight devices and wireless technologies. Due to these concepts of mobile database and its issues and solutions.
latest developments, the concept of mobility of database also Security related strategies and techniques were also provided
comes into the picture. This database technology permits by them. In [6], the authors presented the overall security of
employees using mobile devices to connect to their corporate the mobile database application and they explained that it is
networks, hoard the needed data, work in the disconnected achieved through securing four different areas such as security
mode and reconnect to the network to synchronize with the for Mobile Device, Central Computer, Communication Link
corporate database. In this scenario, the data is being moved and Application Specific issues. They discussed implementing
closer to the applications in order to improve the performance encryption inside database management system as well as
and autonomy. Today, in many corporate, business outside the database. They also provided the security
applications are going mobile and are using business data of implications of Wireless LANs (WLANs) for mobile
an enterprise in a mobile context in order to improve the applications, and provided security tools and solutions. They
revenue by increasing the productivity. So there is a need to also made a comparative study among the various wireless
secure the sensitive data stored in the device. When the security protocols. In [7], the authors analyzed the security
database resides inside the mobile device such as laptops, threats and solutions for various mobile devices and compared
Personal Digital Assistants and smart phones, it is called as Android and iOS. They tried to identify threats and dealt with
Mobile database [1]. In order to secure the mobile database, the subject of security in four fields namely Security of
the following areas are to be secured: Security of the Mobile mobile device, security of operating system on mobile device,
Device, Security of the database that is stored in the mobile security of mobile database and security of mobile network.
device and the Security of the network in which the data They also briefly explained the threats and the possible
transmits. The issues of mobile device security such as solutions.
Physical issue, Logical Issue, Network issue and Personnel
issue and their solutions and recommendations are already
presented in paper [2]. The objective of this paper is to

58 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security
National Conference on Research Issues in Image Analysis & Mining Intelligence (NCRIIAMI-2015)

3. BASIC CONCEPTS Mobile database challenges and limitations are inherited


from the mobile environment
3.1. The Architecture of Mobile database 9 Resource Scarcity Mobile Device
In [8], Vijay Kumar presents Reference architecture (Fig.1) 9 Limited Wireless Bandwidth
for mobile database system. In this architecture, Fixed Host Downstream Direction > Upstream Direction (Client to
(FH) and Base Station (BS) are interconnected through a high Server)
speed wired network. One or more BSs are connected with 9 Frequent Disconnection
Base Station Controller (BSC), which coordinates the Not always connected
operation of BS when commanded by the Mobile Switching
Intermittent connectivity
Centre (MSC). BSs are incorporated with some simple data
Roaming
processing capability so that they can coordinate with
database servers (DBS). Unlimited mobility in PCS (Personal deliberately switch off
Communication Service) and GSM is supported by wireless 9 Not very Secure
link between BS and Mobile units. To utilize full database
functionality, it is necessary to incorporate DBS to PCM or 4. THE SECURITY OF MOBILE
Global System for Mobile Communications (GSM). Each DATABASE
DBS can be reached by any BS or FH. The set of MSC and The fundamental requirements of the database security are
PSTN (Public Switched Telephone Network) connects the Confidentiality, Integrity and Authentication, Authorization
MDS to the outside world. and Non-repudiation. Along with these fundamental
requirements, mobile database needs additional security
requirements due to the mobility of the users, portability of
the mobile devices and the wireless connectivity. It may also
encounter an array of security problems from the mobile
users, hackers and viruses. In order to ensure the security of
the mobile database, proper authentication mechanism,
suitable access control scheme and strong encryption
technique must be implemented. In addition audit and
recovery procedures must be incorporated. Encryption plays a
major role to maintain the confidentiality of the data.

4.1. Confidentiality
In many corporate sectors, businesses are conducted not only
within the confines of their physical boundary but also outside
their buildings. They have employees such as mobile
Fig 1: Reference Architecture workers, who can travel and work at different geographical
areas to conduct transactions. In order to conduct transactions,
Apart from the Reference Architecture for Mobile database, they need corporate sensitive data with them. So the
there are three types of architectures discussed in the literature organizations permit the mobile workers to carry their
namely Client/Server architecture, Server/Agent/Client business data in their mobile devices so that the productivity
architecture and Peer-Peer architecture[9]. Client-Server and the revenue could be increased. While hoarding the data
architecture consists of a server, fixed hosts and mobile in their device, the data should be properly protected
Clients. In Server/Agent/Client architecture, an agent is otherwise it will be exposed to the outsiders including their
included either in the server or in the client side. An agent is a competitors. Therefore, the confidentiality of the database that
computer program that can achieve a series of goals of resides in the mobile device is the major concern for such type
designers and users, and can freely and autonomously operate of organizations.
in the mobile environment. In peer-peer architecture, clients
may also communicate with other clients to share their data. The various encryption techniques used for securing mobile
database is well explained in paper[10]. The necessity for
encryption is strongly suggested in [11], to diminish the risk
3.2. Advantages of intentional or accidental disclosure of sensitive data in
9 Offline access to data the user can read and update the portable devices. Symmetric (or Private Key) encryption and
data without a network connection Asymmetric (or Public key) encryption are the two broader
9 Overcome the problems like dropped connections, low classes of cryptographic algorithms. Compared to Private key
bandwidth and high latency which are common to encryption, Public key encryption is highly computationally
Wireless Networks intensive. Therefore, for resource constraint mobile devices,
9 Increase the battery life of the device, by not using the Symmetric key algorithms are more suitable [12]. A new
modem all the time technique C-SDA (Chip- Secured Data Access) is
9 Reduction in Application cost by reducing the length of recommended in paper [13]. It is a client-based security
Network connection time component acting as an incorruptible mediator between a
9 Wireless airtime fees can be reduced by only mobile client and an encrypted database. This component is
synchronizing the updated data embedded into a smart card to prevent any alteration to occur
on the client side. Light weight encryption algorithms such as
Random number Addressing Cryptography[14] and Scalable
3.3. Challenges Encryption Algorithm[15] are also found in the literature for
light weight devices. These two algorithms are co-designed
using both hardware and software.

59 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security
National Conference on Research Issues in Image Analysis & Mining Intelligence (NCRIIAMI-2015)

4.2. Authentication provide some functionality but contains malicious program.


Authentication determines and verifies the identity of the user Keystroke logging is another type of malware that records
in the system. The mobile user must authenticate himself to keystrokes on mobile device. Using these keystrokes, it
the system whenever he logs on to the system. He/She can set captures the sensitive information and sends it to a
a pass pattern or password or PIN, so that the user can prevent cybercriminals website or e-mail address. Malware also
the unauthorized users from accessing the data. Recently includes viruses, spyware etc. Once it is installed, it can
many biometric authentications such as Fingerprint, Voice initiate an array of attacks and multiply itself on other de-
recognition, Iris recognition, Facial recognition are suggested vices. The malicious applications can do the following
for user authentication because password can be easily functions: retrieving sensitive information, gaining control
guessed. Problem with the biometric methods are cost and over users browsing history, initiating telephone calls,
unreliability. Authentication is also needed when the data is initiating mobile device microphone or camera to secretly
transmitted between the mobile client and the server. record in-formation, and downloading other malicious
Generally authentication is based on one of the following applications.
three factors: What we know (for example Password, PIN), Virus is a program that replicates itself and infects the mobile
What we have (for example hardware tokens) and What we device without knowledge of the user. Initially it infects a
are. (for example Biometric). By giving strong and mobile device and then slowly spreads to the other devices
complicated username and password, we can easily secure the and finally to the server during the synchronization process.
mobile devices Security techniques configured only for detecting the external
attacks can be easily bypassed by such type of viruses. One of
the worst viruses targets the mobile phones and makes the
4.3. Integrity infected phone unusable by locking it up completely. Most of
Data integrity is protected through access control by limiting the viruses enter into the devices by downloading a corrupted
who can access the data. In Client-Server architecture, the email attachment or visiting a phishing website. Ex. Dust,
access rules are enforced in the central server. When the Lasco, Cardblock.
mobile worker wants to access the data, the server checks the
access control rule and permits the user to access. Trojan Horse is a program that embeds itself within an
apparently harmless or trusted application. It depends on the
4.4. Device Corruption action of the user to succeed, and requires successful use of
If the device is corrupted, the database stored inside will also social engineering rather than the ability to exploit flaws in
be destroyed. Therefore proper backup procedure should be the security design or configuration of the target.
implemented to ensure that soon it will put into use again.
Worm is a program which replicates itself to spread across
4.5. Device Lost networks. It can potentially overwhelm mobile devices and
If the device gets lost or stolen, the confidentiality of the data fixed computer systems, and does not need to be a part of
stored is also lost. After a period of time if the device is found, another application in order to spread itself. Ex. CABIR,
integrity may be lost. There is a possibility of installing CommWarrior, Feak.
spyware or adding a physical bug to the hardware that leads to
tampering the system. Although this threat is common for any Spyware is a program which is secretly installed to log and
report user activities and personal data. Ex. FlexiSpy
device, mobile devices are more likely to be lost as they are
small and constantly moving along with their users. Once a
device is lost, everything that is stored inside is also lost. 4.9. Problems due to Employees
Encryption and Remote wiping are the possible solutions for It is a non-technical attack. Due to the lack of awareness of
this problem. security policies, many security breaches occur. Even though
corporate has Standard Policies for mobile device security,
4.6. Interception of data through WiFi, employees dont understand the risks associated with it. In
[16], it is found that careless employees pose greater security
Bluetooth risks (72%) than hackers (28%), which rein-forces the
When the device is enabled with WiFi or Bluetooth, any importance of implementing a strong combination of
person on the same network can easily hack the device and technology and security awareness throughout the
download all the data that are available on the device. organization.
Therefore whenever these facilities are not needed, disable
them. Bring Your Own Device (BYOD) is a recent trend. Generally
the organizations provide Mobile devices to the mo-bile
4.7. Unauthorized access and modification workers for conducting businesses outside the boundaries of
of the data during transmission an Organization. But now-a-days they ask the workers to
bring their own devices. This may also lead to security
Whenever the client wants to send the data over
breaches as they may not be able to control the devices. The
communication media, there is a possibility of not only
trend toward supporting corporate applications on employees
accessing the data as well as modifying the data . Therefore
own notebooks and smart phones is already under way in
care should be taken to encrypt the data in transmit as well as
many organizations and will become commonplace within
to enforce two way authentication method.
four years [17]. The employees are also willing to use private
consumer smart phones or notebooks for business, rather than
4.8. Problems due to Malicious Software using the organization devices. When they use their devices,
Malware is software that is often masqueraded as a game, they must be loyal to the corporate or the organization where
patch or other useful third party software applications. It they work. Leakage of sensitive corporate data is a crime and
passes into the mobile device as a Trojan which appears to

60 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security
National Conference on Research Issues in Image Analysis & Mining Intelligence (NCRIIAMI-2015)

a person found to be involved in such activities should be 7. Whenever the device is lost / thefts immediately inform it
penalized. to the organization, so that data stored inside the device can be
deleted through Remote Wiping.
Implementing strong Security Policies, Installing Monitoring 8. Ensure that Bluetooth, WiFi enabled mobile devices are
Software and educating the employees are the possible turned off when they are not used.
solutions to overcome these issues. 9. Ensure that the network path is secured between Mobile
Client and the Server.
Table 1. Problems and Appropriate Solutions for Mobile 10. Ensure that anti-virus software and Firewalls are installed
Database Security in the mobile devices and updated regularly.
11. Ensure that the mobile device policies are established in
Problems Solutions the organization and the users are informed about the
1. Unauthorized disclosure 1. Light Weight Encryption importance of policies and the means of protecting their
of the data algorithm information.
2. Strong Encryption algorithm 12. Ensure that the mobile device uses pattern to lockout the
2. Unauthorized access of Password / PIN / Token / device.
the data Biometric factors like
Fingerprint, Iris recognition, 6. CONCLUSION
Voice recognition etc. A study by Gartner found that 45 percent of workers in the
3. Unauthorized 1. Grant / Revoke command United States work outside the traditional office at least eight
modification of the data hours per week. By 2014, 90 percent of organizations will
support corporate applications on personal devices[18].
4. Destruction of Sensitive Backup / Recovery procedure Therefore, there is a need to analyze the problems and identify
data due to the device the appropriate solutions for securing Mobile Database. It is
corruption also important to continue to explore the problems faced by
5. Unauthorized disclosure 1. PIN/Password/Pass database systems in the mobile environment.
of data due to device code/pass pattern
lost/theft 2. Remote Wiping
7. REFERENCES
6. Interception of data Turn off Bluetooth and WiFi [1] Ouri Wolfson, 2009. Mobile Database, Encyclopedia of
through WiFi, Bluetooth facilities. Database Systems. Part 13, pp. 1751,
etc. http://www.springerlink.com/content/n72wu51n4056524g/full
text.html.
7.Unauthorized access and 1. Encryption
modification of the data 2. Server Authentication and
[2] D. Roselin Selvarani and Dr. T. N. Ravi, Issues,
during transmission Client Authentication
Solutions and Recommendations for Mobile Device
3. WPKI Certificate
Security, International Journal of Innovative Research in
8. Problems due to the 1.Anti-virus Software
Technology and Science (IJIRTS), Vol.1, No. 5, pp. 9-14,
malicious software such as 2. Firewall
November 2013, ISSN : 2321- 1156.(online)
viruses or Trojan horses
http://ijirts.org/volume1issue5/IJIRTSV1I5027.pdf
9. Problems due to the Establish the mobile device
employees / insiders policies and enforcing penalty
[3] Parviz Ghorbanzadeh, Aytak shaddeli, Roghieh
- BYOD for the workers who disobey
Malekzadeh, Zoleikha Jahanbakhsh, A Survey of Mobile
the policies.
Database Security Threats and Solutions for It,3rd
International Conference on Information Sciences and
Interaction Sciences (ICIS), pp.676-682, 2010.
5. SUGGESTIONS FOR
[4] Tao Zhang, Shi Xing-jun, Status Quo and Prospect on
ORGANIZATIONS Mobile Database Security, Telkomnika, Vol. 11, No. 9, pp.
1. Ensure that data stored in the mobile devices are encrypted 4949-4955, September 2013.
using Light Weight Encryption Algorithm.
2. Ensure that all the data stored in the Removable drive such
[5] A.R. Bhagat, V.B. Bhagat, Mobile Database Review and
as USB flash drive, memory card are also encrypted properly, Security Aspects, International Journal of Computer Science
so that it will not be revealed to the unknown person in case of and Mobile Computing, Vol.3 Issue.3, pp. 1174-1182, March
loss. 2014.
3. Ensure that proper Authentication mechanism is
incorporated to determine and verify the users identity.
[6] Sanket Dash, Malaya Jena, In the Annals of Mobile
4. Ensure that appropriate Access Control mechanisms are Database Security, CPMR-Interntional Journal of
established to protect data integrity by restricting who can Technology, Volume 1, No. 1, December 2011.
alter data.
5. Ensure that periodic backups of mobile devices are done so
[7] Hezal Lopes, Rahul Lopes, Comparative Analysis of
that it can be quickly put into use again whenever any
Mobile Security Threats and Solution, International Journal
destruction occurs. of Engineering Research and Applications, Vol. 3, Issue 5, pp.
6. Ensure that mobile device has lockout facility so that when 499-502, Sep-Oct 2013.
an unauthorized person tries to enter PIN/Password/ Pass code
unsuccessfully for more than 5 times, the device will be
automatically locked out.

61 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security
National Conference on Research Issues in Image Analysis & Mining Intelligence (NCRIIAMI-2015)

[8] Vijay Kumar, Mobile Database Systems, Wiley- [13] Luc Bouganim, Philippe Pucheral, 2002. Chip-Secured
Interscience, John Wiley & Sons, Inc. Publication, New Data Access: Confidential Data on Untrusted Servers.
Jersey, 2006. Proceedings of the 28th VLDB Conference, Hong Kong,
China.
[9] D. Roselin Selvarani, Dr. T.N. Ravi, A Survey on Data
and Transaction Management in Mobile Databases, [14] Masa-aki Fukase, Tomoaki Satot, Innovative Ubiquitous
International Journal of Database Management Systems Cryptography and Sophisticated Implementation, pp.364-
(IJDMS), Vol.4, No.5, pp. 1-20, October 2012, ISSN: 0975- 369, ISCIT 2006, IEEE, 2006.
5705, 0975-5985. DOI: 10.5121/ijdms.2012.4501.
[15] B. Praveen Kumar, P. Ezhumalai, P. Ramesh, Dr. S.
[10] D. Roselin Selvarani and Dr. T. N. Ravi , A Review on Sankara Gomathi , Dr . P. Sakthivel, Improving the
the role of Encryption in Mobile Database Security, Performance of a Scalable Encryption Algorithm (SEA) using
International Journal of Application or Innovation in FPGA, International Journal of Computer Science and
Engineering & Management (IJAIEM), Volume 3, Issue 12, Network Security, Vol.10, No.2, February 2010.
pp. 76-83, December 2014, ISSN 2319 4847.
[16] The Impact of Mobile Devices on Information Security:
[11] Tao Zhang, Shi Xing-jun, Status Quo and Prospect on A Survey of IT Professionals, Dimensional Research |
Mobile Database Security, Telkomnika, Vol. 11, No. 9, pp. January 2012. www.dimensionalresearch.com
4949-4955, September 2013.
[17]http://pewinternet.org/Reports/2011/Smartphones/Summa
[12] Faith M. Heikkila, Encryption: Security Considerations ry.aspx
for Portable Media Devices, IEEE Security and Privacy,
2007. [18]https://www.usa.canon.com/CUSA/assets/app/pdf/ISG_S
ecurity/CanonQuickPulse_ITWorld_Whitepaper.pdf

62 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security
National Conference on Research Issues in Image Analysis & Mining Intelligence (NCRIIAMI-2015)

An Enhanced Encryption Technique for Wireless Sensor


Networks
J.Daniel Mano * Dr. S.SATHAPPAN
Ph.D Research Scholar Associate Professor
Department of Computer Science Department of Computer Science
Erode Arts and Science College (Autonomous) Erode Arts and Science College (Autonomous)
Erode, Tamil Nadu, India Erode, Tamil Nadu, India
danie2516@gmail.com devisathappan@yahoo.co.in

ABSTRACT been left as a secondary requirement. In WSNs, energy


Wireless Sensor Networks is a without wired network of efficiency is the main task. There are large opportunities of
many autonomous small size sensor nodes that are self energy savings at the MAC layer. In parer four sources of
organized and use sensor to monitor the real world energy wastes have been identified: collisions, control packet
environment. The development of wireless sensor network overhead, listening to a transmission destined to someone else
was motivated by military applications for battlefield (overhearing) and idle listening. Most important source of
surveillance; today such networks are used in many industrial energy savings in a sensor network is to avoid idle listening.
and consumer applications. Many encryption algorithms are One way to avoid idle listening is to use the TDMA protocol.
widely available and used in information security. They can
be categorized into Symmetric (private) and Asymmetric 1.1 Security Issues for Wireless Sensor
(public) keys encryption. This research work investigates Networks
different cryptographic techniques such as symmetric key A) Limited Resources
cryptography, asymmetric key cryptography and we proposed
an efficient technique for wireless sensor networks. In
All security approaches require a certain amount of
addition to that the proposed technique is compared with
resources for the implementation, including data memory,
different encryption technique based on different metrics such
code space, and energy to power the sensor. However,
as key size, packet size, bandwidth, and throughput. The
currently these resources are very limited in a tiny wireless
proposed system is focused on self-organized key distribution
sensor.
and secured message transmission for wireless sensor
network.
Keywords: WSN, Cryptography, NS2. B) Limited Memory and Storage Space

A sensor is a tiny device with only a small amount of


1. INTRODUCTION memory and storage space for the code. In order to build an
Wireless Sensor Network effective security mechanism, it is necessary to limit the code
size of the security algorithm. [1]
A wireless sensor network is a collection of nodes
organized into a cooperative network. A WSN can be defined 1.2 Cryptography Techniques of Wireless
as a network of devices, which can sense the environment and
Sensor Network Design
communicate the information gathered from the monitored
field through wireless links [2][3]. The data is forwarded,
possibly via multiple hops, to a sink (sometimes denoted as Security is the combination of processes, procedures,
controller or monitor) that can use it locally connected to and systems used to ensure confidentiality, authentication,
other networks (e.g., the Internet) through a gateway. The integrity, availability, access control, and non-repudiation.
nodes can be stationary or moving. They can be aware of their Confidentiality:
location or not. They can be homogeneous or not. The goal of confidentiality is to keep sending
Figure 1: Wireless Sensor Networks information from being read by unauthorized users or nodes.
WSNs use an open medium, so usually all nodes within the
direct transmission range can obtain the data. One way to
keep information confidential is to encrypt the data. In WSNs,
confidentiality is achieved to protect information from
disclosure when communication is between one sensor node
and another sensor node or between the sensors and the base
station. Compromised nodes may be a threat to confidentiality
if the cryptographic keys are not encrypted and stored in the
node.
Authentication:
The goal of authentication is to be able to
Mac Protocols for Wsn
identify a node or a user and to prevent impersonation. In
Medium Access Control protocols designed for
wired networks and infrastructure-based wireless networks, it
wireless LANs have been optimized for maximum throughput
is possible to implement a central authority at a router, base
and minimum delay, while the low energy consumption has

63 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security
National Conference on Research Issues in Image Analysis & Mining Intelligence (NCRIIAMI-2015)

station, or access point. However, there is no central authority


in WSNs, and it is much more difficult to authenticate an Amankumar et al presented the comparative analysis between
entity. DES and RSA. Performance of different algorithms was
different according to data loads. Finally author concluded
Integrity: that the throughput of DES algorithm was much better than
The goal of integrity is to sent message from the throughput of RSA algorithm. [8]
being illegally altered or destroyed during transmission. When
the data is sent through the wireless medium, the data can be Hyo-Won Kim et al proposed DSP was efficient method of
modified or deleted by malicious attackers. The malicious optimizing a 128 bit AES block cipher, OCB mode and high
attackers can also resend it, an action known as a replay performance. The author concluded the simulation result the
attack. Integrity can be achieved through hash functions. encryption / decryption performance of the implemented
Non-repudiation: block cipher was 401Mbps, 406Mbps respectively at 1GHz
The goal of non-repudiation is related to the clock speed. Result corresponds to 50% faster than general
fact that if an entity sends a message, the entity cannot deny implementation with 3.5% more memory usage. [9]
that it sent the message. By producing a signature for the
message, the entity cannot later deny having sent that 3. PROPOSED METHODOLOGY
message. In public key cryptography, a node, A, signs the
message using its private key. All other nodes can verify the A Wireless Sensor Network (WSN) is a self-
signed message by using A's public key, and A cannot deny organizing system of sensor nodes. The nodes in WSN are
that its signature is attached to the message. free to move in Random. The nature of the Wireless Sensor
Availability: Networks makes them very vulnerable to an opponents
The goal of availability is to keep the network Security threats. Providing security through cryptographic
service or resources available to legitimate users. It ensures algorithm in these networks is very important. To provide an
the survivability of the network despite malicious incidents. In information security in WSN symmetric encryption algorithm
a WSN, the examples of risk of loss of availability can be plays a main role among the entire cryptography algorithm.
sensor node capturing and denial of service attacks. One Hence to present research purpose an Enhanced Encryption
solution could be to provide alternative routes in the protocols Standard algorithm in terms of less energy consumption
employed by the WSN to mitigate the effect of outages. through limited computation by reducing number of rounds
Access control: and decreasing key size. An enhancement to AES (Advanced
The goal of access control is to prevent encryption standard) which consider the limitation in CPU
unauthorized use of network services and system resources. memory and battery utilization in Wireless sensor nodes. The
Obviously, access control is tied to authentication attributes. symmetric key is shared in more secured way by using Diffie
In general, access control is the most commonly needed Hellman key exchange algorithm protocol.
service in both network communications and individual
computer systems. [4]
EES ALGORITHM:
2. LITERATURE SURVEY
//EES Encryption//
F. Amin and A.H.Jahangri. et al focused on many security
Step1: The given input has 120 bit plain text
solutions that had been proposed in the domain of wireless
sensor network. These survey, however were mostly focused which is divided into 2 halves Of 60 bit each.
on the presentation of the techniques used in wireless sensor Step2: Choose a variable key length of 120, 170,
networks node that performed public key cryptography Or 200 bits (default 200)
operations. Finally it was concluded that the Asymmetric was Step3: Encrypt data blocks of 120 bits into
not well suited for wireless sensor networks. [5] 10, 12, and 14 rounds Depending on
the key size.
Nidhisinghal and J.P.S.Raina et al presented the comparison Step4: S-Box contains the each bit Shift rows,
of the AES and RC4 algorithms. The aim of this paper was to
Mix Column process.
show different modes operation, different settings, memory
utilization, and throughput and encryption time. Finally the Step5: Add Round after the Swapping
experiments proved that the RC4 was energy efficient for And final permutation Procedure
encryption and decryption. [6] 60 bits Cipher plain text.
//EES Decryption//
MansoorEbrahim et al presented a comparative analysis of //*Decryption is the reverse process of
different existing cryptographic algorithms (symmetric) based Encryption Process*//.
on their Architecture, Scalability, Flexibility, Reliability,
Security and Limitation that were essential for secure
communication (Wired or Wireless).the symmetric block 4. PERFORMANCE EVALUATION
encryption algorithms was presented on the basis of different
parameters. Simulator Scenario
The performance of the most popular symmetric
key algorithm in terms of Authentication, Flexibility, The performance analysis is carried out with
Reliability, Robustness, Scalability, Security, finally author different parameters using Network Simulator2.Our
analyzed AES (Rijndael) was the best among all in terms of
proposed algorithm is compared with RSA, RC4, and AES
Security, Flexibility, Memory usage, and Encryption
performance. [7] with throughput, bandwidth, key size, and different packet

64 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security
National Conference on Research Issues in Image Analysis & Mining Intelligence (NCRIIAMI-2015)

size. The following table 4.1 shows the parameters used in our
scenario.
Table 4.3 - Different Packet Size
Table 4.1 - Parameters used in our Scenario
PARAMETERS VALUES

Area of Simulation (500 * 500)m Algorithm Packet Size Time(millisecond)


1 3
Number of Nodes 50 RSA 5 6
Types of routing protocol AODV 15 9
1 3
Internet Protocol Type TCP
RC4 5 7
Antenna Model Omni directional 15 10
Type of the MAC 802.11 1 0.24
Security Algorithm Enhanced Encryption AES 5 0.18
Standard 15 0.06

Number of Packets 15 1 0.11


Proposed 5 0.12
EES
15 0.07
Performance Metrics Are Calculated Based On Following
Task:

1. The Comparison key size on RSA, RC4, AES and proposed


EES Algorithm.
2. Calculate the Encryption and Decryption time for RSA,
RC4, AES and Proposed EES Algorithm using different
Packet Size.
3. Compute the throughput for RSA, RC4, AES and proposed
EES algorithm in Millisecond.
4. Compute the Bandwidth for RSA, RC4, AES and proposed
EES Algorithm

(i). Comparison of key size.

The comparison of key size for proposed approach


with the existing approaches is shown in Table 4.2. The
proposed approach obtains very less key size when compared
for RSA, RC4, and AES.
Figure 2: Encryption / Decryption time for different Packet Size
Table 4.2: Comparison based on key size

RSA RC4 AES Proposed


EES
512 420 128 120
(iii). Compute the throughput for RSA, RC4, AES
1024 1000 192 170 and proposed EES Algorithm In Millisecond
1536 1432 256 200 The Throughput comparison of the proposed
approach is represented in Figure 3. It is observed that the
proposed EES approach shows improvement in throughput
when compared to the other three approaches.

(ii). Encryption / Decryption time for different


Packet Size.

The comparison of encryption time for the proposed


approach is shown in Table 4.3. The encryption time for the
different packet sizes of our proposed approach has very less
time when compared to existing approaches like RSA, RC4,
and AES.

65 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security
National Conference on Research Issues in Image Analysis & Mining Intelligence (NCRIIAMI-2015)

basics of parametric standards like throughput, bandwidth,


different key size, and Encryption/Decryption time. The
performance is compared to the standard approaches like
AES, RSA, and RC4. It is clearly observed that the
experimental results of the proposed approach outperform the
standard approaches.

The Future Enhancement of this research work would


be to increase the throughput with less time. Currently our
scheme is workable in a one-level clustered network
environment.ie the aggregation on one hop to the base station.
However, in real deployment it is usually not the case. So our
future works towards to multilevel cluster environment. In
addition, the optimal life time of batteries can be estimated.

Figure 3: Throughputs for RSA, RC4, AES, and EES


REFERENCES
Algorithm
[1] Matthew N. Vella, Texas A&M University-Corpus Christi,
Computer Science Program, Dr. Ahmed Mahdy
(iv). Compute the Bandwidth for AES, RC4, RSA Texas A&M University-Corpus Christi, Computer Science
and proposed EES Algorithm. Faculty Survey of Wireless Sensor Network Security.

Bandwidth comparison of the proposed approach is [2]SundipMisra,VivekTiwari and Mohammad S. Obaidat,


shown in Figure 4. The highest bandwidth is obtained for our Fellow, IEEE LACAS: Learning Automata-Base
CongestionAvoidance Scheme for Healthcare Wireless Sensor
approach. The bandwidth obtained for the other approaches is Networks, IEEE Journal on Selected Areas in
very much less when compared to the proposed approach. Communications, Vol. 27, No. 4, May 2009.
Thus the proposed approach outperforms the other three
approaches in terms of bandwidth. [3] Chien-Wen Chiang, Chih-Chung Lin, Ray-I Chang A
New Scheme of Key Distribution using Implicit Security
inWireless Sensor Networks Feb. 7-10, 2010 ICACT 2010.

[4] Jianmin Chen and Jie Wu on cryptography applications to


secure MANETs and WSNs.

[5] F Amin, H Jahangir, and H Rasifard.Analysis of Public-


Key Cryptography for Wireless Sensor Networks Security.
World Academy of Science, Engineering and Technology,
31(July):530535, 2008.

[6] NidhiSinghal Comparative analysis of AES and RC4


algorithms for better Utilization.IJCTT- July to Aug Issue
2011.

[7] MansoorEbrahim: Symmetric algorithm survey: A


Comparative Analysis, IQRA University Main Campus
Defense View, And Karachi

[8] Aman Kumar: Comparative Analysis between DES and


Figure 4: Bandwidth for AES, RC4, RSA and proposed EES RSA, (M.Tech student BRCMBahal (Bhiwani)
Algorithm
[9] Hyo-Won Kim, Su-Hyun Kim, Sun Kang, and Taejoo
5. CONCLUSION & SCOPE FOR Chang: Fast Implementation of a 128bit AES Block Cipher
Algorithm OCB Mode Using a High Performance
FUTUREWORK DSPBOOKS:

CONCLUSION [10].A Secure Communication for Wireless Sensor


An efficient and more secured data Networks. Through Hybrid (AES +ECC Algorithm),
transformation technique is required in area like SandeepKumar Namini (LAMBERT Academic Publishing).
military, remote environmental monitoring, target
tracking. etc. The present Research work proposes a
novel encryption technique for wireless sensor network.
The performance of the proposed approach is evaluated on the

66 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security
National Conference on Research Issues in Image Analysis & Mining Intelligence (NCRIIAMI-2015)

BOOKS

[11]. Sensor and Ad-hoc Networks. Theoretical &


Algorithm Aspects.S.kamimakki,
xiang yang li, nikipissinou, KIA makki.

[12].Wireless Sensor Networks.Architectures and Protocols,


Edgar H. Callaway, Jr.

[13].Wireless Sensor Networks for Healthcare applications,


Michael MC Grath.

[14].Wireless Sensor Networks & Ecological Monitoring,


Subhas Chandra, Joe Air Jiang.

[15].IndustrialWirelessSensornetworks,Applications,protoc
ols, & Standards ,V.Cagrigungor& Gerhard P.Hancke.

[16].Wireless Sensor networks & Applications, Signal and


Communication technology, Ying shu li, my Thai,weiliWu

67 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security
National Conference on Research Issues in Image Analysis & Mining Intelligence (NCRIIAMI-2015)

Investigation on Cyber Crime, Cyber Law and Cyber Security


Prof.T.Ranganayaki 1 Prof.Dr.M.Venkatachalam2
Associate Professor & Ph.D Research Scholar Associate Professor&Head
Dept. of Computer Science Dept. of Electronics
Erode Arts and Science College. Erode Arts and Science College.
Erode-9, Tamil Nadu. Erode-9, Tamil Nadu.
E-Mail: ranganayakitcs@gmail.com E-Mail:eacmvenket@yahoo.com

ABSTRACT disruption [4]. Interim measures may include there location of


Cyber-crime may be said to be those crimes, of IT systems and operations to an alternate site, the recovery of
which, genus is the conventional crime, where either the IT functions using alternate equipment, or the performance of
computer a tool or medium of conducting or committing IT functions using manual methods to minimize the business
crime. Cybercrime is easy to commit, hard to detect and often impact.
hard to locate in jurisdictional terms, given the geographical 1.1 Surveillance
indeterminacy of the net. Cyber criminals can destroy web Many companies overlook the fact that security
sites and portals by hacking and planting viruses play online monitoring or surveillance is necessary in order to protect
frauds by transfer of funds from one corner of the globe to their information assets. Security Information Management
another and gain access to highly confidential and sensitive Systems(SIM), if configured properly, can be useful in
information. Cyber law is the law governing cybercrime. In collecting and correlating security data 2009, Bruce S.
India there is single act that deals with cyber law The Schaeffer, Hence free Chan, Henry Chan, and Susan
Information Technology Act 2000.This study reveals the Ogulnick7 (system logs, firewall logs, anti-virus logs, user
overall investigation on cybercrime, law and security. profiles, physical access logs, etc.) to help identify internal
threats and external threats[5]. A successful surveillance
Keywords program includes practices such as Security in Depth is a best
Cyber Crime, Cyber Law, Security, sensitive practice. Several layers of security are better than one.
information, computer virus,, hacking, indeterminacy. Surveillance on each layer of security will help identify the
1. INTRODUCTION severity of a security event; alerts coming from the internal
Cyber crimes have significantly increased in India. corporate network might be more urgent than on the external
The trends in this regard are not very promising [1]. For network. Critical business data should be encrypted with strict
instance, the cyber law , cyber security and cyber forensics role-based access controls and logging of all changes for an
trends in the year 2013 have showed poor performance of accurate audit trail. A policy of least privileges access
Indian government in these fields. This position has not should always be implemented with respect to sensitive
changed in 2014 as well. For instance, the cyber forensics information and logs should be reviewed regularly for
trends of India 2014 still show inability of India to deal with suspicious activity [7]. Review of Identity Management
cyber forensics related issues. India is also clinging to Process to determine who has access to what information on
outdated laws like cyber law and telegraph law and is not the corporate network. Ensure that the access of ex-
investing effectively in the field of intelligence agencies and employees, contractors and vendors is eliminated when they
law enforcement technology for India.Cyber crimes are are no longer needed or leave the organization. Placement of
increasing in India and we do not have a robust cyber Network Intrusion Detection/Prevention Systems throughout
law and cyber crime investigation infrastructure in India the corporate network to help detect suspicious or malicious
[2][3]. Incidences like e-mail cracking, Facebook, activity [9].
misuse, intellectual property thefts, etc. have significantly 1.2 Security Training and Awareness
increased in India due to absence of a framework. The human factor is the weakest link in any
The commonly accepted definition of cyber information security program. Communicating the importance
security is the protection of any computer system, software of information security and promoting safe computing are key
program, and data against unauthorized use, disclosure, in securing a company against cyber crime [6]. Below are a
transfer, modification, or destruction, whether accidental or few best practices:
intentional. Cyber-attacks can come from internal networks, Use a passphrase that is easy to remember
the Internet, or other private or public systems. Businesses E@tUrVegg1e$ (Eat your veggies) and make sure to use a
cannot afford to be dismissive of this problem because those combination of upper and lower case letters, numbers, and
who dont respect address, and counter this threat will surely symbols to make it less susceptible to brute force attacks[8].
become victims. Try not to use simple dictionary words as they are subject to
IT (Information Technology) systems are vulnerable dictionary attacks a type of brute force attack.
to a variety of disruptions from a variety of sources such as Do not share or write down any passphrases.
natural disasters, human error, and hacker attacks. These Communicate/educate your employees and executives on the
disruptions can range from mild (e.g. short-term power latest cyber security threats and what they can do to help
outage, hard disk drive failure) to severe (e.g. equipment protect critical information assets.
destruction, fire, online database hacked). Crisis (and Disaster Do not click on links or attachments in e-mail from
Recovery) planning refers to those interim measures needed to untrusted sources.
recover IT services following an emergency or system

68 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security
National Conference on Research Issues in Image Analysis & Mining Intelligence (NCRIIAMI-2015)

Do not send sensitive business files to personal email Backdoor An application or service that permits remote
addresses. access to an infected computer. It opens up a so-called
Have suspicious/malicious activity reported to security backdoor to circumvent other security mechanisms.
personnel immediately. Adware Software that displays banner ads or pop-ups
Secure all mobile devices when traveling, and report lost or when a computer is in use. The presence of adware is likely if
stolen items to the technical support for remote dubious offers are displayed as pop-ups or banner ads even
kill/deactivation. when you are visiting a reputable website and have a pop-up
Educate employees about phishing attacks and how to report blocker enabled. Even though adware is not classified as
fraudulent activity. harmful malware, many users regard it as irritating and
intrusive. Adware can often have undesired effects on a
1.3 How Computers Work system, even interrupting the Internet connection or system
Hardware All of a computers physical components
operations.
including the mouse, keyboard, screen, and printer as well as
Trojans From Greek legend of the Trojan horse. In the
internal parts like the processor and hard drive.
world of computers, it refers to covert infiltration by
Operating system Creates the connection between the
malicious software under the guise of a useful program. After
computers hardware and the application software employed
a Trojan is activated, it is often very difficult to discover the
by the user. Common operating systems include Microsoft
extent of the damage and generally identify the malware. The
Windows, Mac OS and Linux.
Trojan may change its original name and reactivate every time
Software A set of instructions that cause the computer to
a PC is started
perform certain tasks; can be divided into two types: system
software and application software 2.1 Protection
Browsers Programs that look through content published Anti-virus software Software that detects and removes
on the Internet and display Internet pages. The most viruses.
commonly used browsers are Microsoft Explorer Firewalls A personal firewall is a program that works on a
and Mozilla Firefox. PC as a protective filter for data communication in a
potentially dangerous network such as the Internet.
2. VULNERABILITIES
3. ENCRYPTION TECHNOLOGY
Malware The name given to malicious software that Another factor that can complicate the investigation
operates under the guise of a useful software program. It runs of cybercrime is encryption technology, which protects
computer processes that are either unexpected or unauthorized information from access by unauthorized people and is a key
but always harmful[14]l. The term malware generally technical solution in the fight against cybercrime. Encryption
covers viruses, worms and Trojans. is a technique of turning a plain text into an obscured format
Viruses Software with the ability to self-replicate and by using an algorithm. Like anonymity, encryption is not new,
attach itself to other executable programs. The behavior is but computer technology has transformed the field. For a long
comparable to its biological counterpart. Computer viruses time it was subject to secrecy. In an interconnected
can also be contagious (might spread on or even beyond the environment, such secrecy is difficult to maintain.
infected computer), exhibit symptoms (the presence of The widespread availability of easy-to-use software
malicious code and its magnitude) and involve a recovery tools and the integration encryption technology in the
period with possible long-term effects (difficulty in removal operating systems now makes it possible to encrypt computer
and loss of data). data with the click of a mouse and thereby increases the
Worm An autonomous program or constellation of chance of law-enforcement agencies being confronted with
programs that distributes fully functional whole or parts of encrypted material. Various software products are available
itself to other computers. Worms are specialists in spreading that enable users to protect files against unauthorized access.
and reproducing. They consistently exploit all known But it is uncertain to what extent offenders already use
vulnerabilities, including people, to penetrate barriers that encryption technology to mask their activities. One survey on
seem to be impenetrable to normal viruses. A worm does not child pornography suggested that only 6 per cent of arrested
have a payload of its own but is often used as a transport child-pornography possessors.
mechanism for viruses that ride piggyback and immediately Used encryption technology, but experts highlight
start their work. the threat of an increasing use of encryption technology in
Gray ware Applications that cause annoying behavior in cybercrime cases. There are different technical strategies to
the way programs run. Unlike malware, gray ware does not cover encrypted data and several software tools are available
fall into the category of major threats. Gray ware is not to automate these processes. Strategies range from analyzing
detrimental to basic system operations. weakness in the software tools used to encrypt files, searching
Spyware Software that installed under misleading for encryption passphrases and trying typical passwords, to
premises that monitors and collects a users data and complex and lengthy brute-force attacks.
eventually transmits it to a company for various purposes. The term brute-force attack is used to describe the
This typically happens in the background - that is, the activity process of identifying a code by testing every possible
is invisible to most users. combination. Depending on encryption technique and key
Phishing A method of stealing personal data whereby an size, this process could take decades. For example, if an
authentic-looking E-mail is made to appear as if it is coming offender uses encryption software with a 20-bit encryption,
from a real company or institution[15]. The idea is to trick the the size of the key space is around one million. Using a
recipient into sending secret information such as account current computer processing one million operations per
information or login data to the scammer. 2009, Bruce S. second, the encryption could be broken in less than one
Schaeffer, Hen free Chan, Henry Chan, and Susan Ogulnick second. However, if offenders use a 40-bit encryption, it could
Dialers Dialing programs used to dial up an Internet take up to two weeks to break the encryption.
connection using preset and typically overpriced phone In 2002, the Wall Street Journal was for example
numbers. able to successfully decrypt files found on an Al Qaeda

69 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security
National Conference on Research Issues in Image Analysis & Mining Intelligence (NCRIIAMI-2015)

computer that were encrypted with 40-bit encryption. Using a


56-bit encryption, a single computer would take up to 2 285
years to break the encryption[18]. If offenders use a 128-bit
encryption, a billion computer systems operating solely on the
encryption could take thousands of billions years to break it.
The latest version of the popular encryption
software PGP permits 1 024-bit encryption. Current
encryption software goes far beyond the encryption of single
files[21]. The latest version of Microsofts operating systems,
for example, allows the encryption of an entire hard disk.
Users can easily install encryption software. Although some
computer forensic experts believe that this function does not Fig 1: Perception about the State of Cyber Defense in
threaten them, the widespread availability of this technology Organizations
for any user could result in greater use of encryption. Tools
are also available to encrypt communications [19] for
example, e-mails and phone calls that can be sent using
VoIP. Using encrypted VoIP technology, offenders can
protect voice conversations from interception [10-13].
Techniques can also be combined. Using software tools,
offenders can encrypt messages and exchange them in
pictures or images this technology is called steganography.
Central Bureau of Investigation (CBI) has also registered
cases pertaining to financial 12 frauds under the provisions of Fig 2: Cyber Security Threats sometimes Fall Through the
Information Technology Act 2000 along with other Acts cracks of existing security systems.
Which are as follows:-
Table 1. Frauds happened at these year
Year No. of Amount Year
Cases Involved in crore
(Rs. in Crore)
2010 6 6.42 2010
2011 10 12.43 2011
Fig 3: we have adequate intelligence to know about an
3.1 Understanding Cybercrime: attempted attack and its impact
Phenomena, Challenges and Legal
Responsibilities
New hardware devices with network technology are
also developing rapidly [16]. The latest home entertainment
systems turn TVs into Internet access points, while more
recent mobile handsets store data and connect to the Internet
via wireless networks [17]. USB (universal serial bus)
memory devices with more than 1 GB capacity have been
integrated into watches, pens and pocket knives. Law-
enforcement agencies need to take these developments into
account in their work it is essential to educate officers
involved in cybercrime investigations continuously, so they Pie chart reports the respondents organizational level
are up to date with the latest technology and able to identify within participating organizations. By design, 59 percent
relevant hardware and any specific devices that need to be respondents are at or above the supervisory levels.
seized. Another challenge is the use of wireless access
points[20]. The expansion of wireless Internet access in
developing countries is an opportunity, as well as a challenge
for law-enforcement agencies. If offenders use wireless access
points that do not require registration, it is more challenging
for law enforcement agencies to trace offenders, as
investigations lead only to access points.

Fig 4: organizational level describes best current position


Fig 4 Reports the industry segments of respondents
organizations [22]. This chart identifies financial services
(18 percent) as the largest segment, followed by public sector
(15 percent) and industrial (11 percent).

70 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security
National Conference on Research Issues in Image Analysis & Mining Intelligence (NCRIIAMI-2015)

4. CONCLUSION [16] Copyrights, Trademarks & Literary Prop. Course,


The risks of cyber crime are very real and too Handbook
ominous to be ignored. Every franchisor and licensor, indeed [17] Source: Good (Economics and Accounting),
every business owner, has to face up to their vulnerability and Wikipedia, the free encyclopedia (15 Dec. 2005).
do something about it. At the very least, every company must Xingan Li 23
conduct a professional analysis of their Cyber security and
cyber risk; engage in a prophylactic plan to minimize the [18] Paul A. Samuelson, The Pure Theory of Public
liability; insure against losses to the greatest extent possible; Expenditure, Review of Economics and Statistics 36
and implement and promote a well-thought out cyber policy, (November 1954): 387-389.
as new and new opportunities and challenges are surfacing, [19] Hal R. Varian, System Reliability and Free Riding, in
Cyber law, being a constantly evolving process, is suitably Proceedings of the First Workshop on Economics and
modifying itself to fit the call of the time. Though, Information Security (University of California,
Information Technology Act2000, itself is a comprehensive Berkeley, 16-17 May 2002), (14 Dec. 2005).
legislation but it has had some inherent shortcomings. Cyber
law is likely to see various emerging trends that will have to [20] Powell, Is Cyber security a Public Good? Evidence
be appropriately addressed by law makers. According to this from the Financial Services Industry.
investigation, reports reveals that there is an emergency for
round the clock analysis and reports must be generated to [21] Neal Kumar Katyal, The Dark Side of Private
Ordering for Cyber security, in The Law and
overcome future cyber space related issues.
Economics of Cyber security, ed. Mark F. Grady and
5. REFERENCES Francesco Paris (Cambridge University Press,
[1] P M Bakshi, Hand book of Cyber & E-Commerce, November 2005).
Bharat Law House PvtLtd [22] Bruce H. Kobayashi, An Economic Analysis of the
[2] R. K. Suri, T. N. Chhabra, Cyber Crime, Pentagon Press Private and Social Costs of the Provision of Cyber
[3] V.D. Dudeja, Crimes in Cyber Space (Scams and security and Other Pubic Security Goods, Supreme
Frauds), Court Economic Review 14 (2005), (15 Dec. 2005).
[4] Rohas Nagpal, Introduction to Indian Cyber Law
[23] Ian C. Ballon, Alternative Corporate Responses to
[5] Abha Chauhan, Evolution and development of cyber law Internet Data Theft, in 17th Annual Institute on Data
A study with special reference to India Theft, in 17th Annual Institute on Computer Law 737,
[6] Vikas Asawat, Information Technology (Amendment) 744 (PLI)..
Act, 2008: A new vision through a new change
[7] Dr. Vijay kumar Shrikrushna Chowbe, An Introduction
to cybercrime: General Consideration
[8] Dr. Vijay kumar Shrikrushna Chowbe, The concept of
Cyber Crime: Nature & Scope International Journal of
Humanities and Applied Sciences (IJHAS) Vol. 2, No. 4,
Camp and Wolfram, Pricing Security.Commonwealth
Publishers
[9] CERT Coordination Center, CERT/CC Statistics 1988-
2005 (2005), (14 Dec. 2005).
[10] Peter G. Neumann, Information System Adversities
and Risks (paper presented at the Conference on
International Cooperation to Combat Cyber Crime and
Terrorism, Stanford, CA: Hoover Institution, 1999).
[11] Carl Howe, John C. McCarthy, Tom Buss, and Ashley
Davis, The Forrester Report: Economics of Security,
(February 1998).
[12] Gordon, Loeb, Lucyshyn, and Richardson, Tenth
Annual CSI/FBI Computer Crime and Security Survey,
[13] Stephen J. Lukasik, Protecting the Global Information
Commons, Telecommunication Policy 24, no. 6-7
(2000): 519-531. Series No. 471, 1997).
[14] Robert L. Ullman and David L. Ferrera, Crime on the
Internet, Boston Bar Journal, no. 6
(November/December 1998).
[15] L. Jean Camp and Catherine Wolfram, Pricing
Security, in Proceedings of the CERT Information
Survivability Workshop (Boston, Massachusetts, 24-26
October 2000), 31-39, (14 Dec. 2005).

71 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security
National Conference on Research Issues in Image Analysis & Mining Intelligence (NCRIIAMI-2015)

Enhanced Detection and Prevention of DDoS Attacks using


Packet Filtering Technique
M.Sivakumar* C.Senthilkumar
Assistant Professor and Research Scholar Associate Professor
Department of Computer Science Department of computer science
Sri Kandhan College of Arts and Science Erode Arts and Science College
Erode,Tamilnadu, India Erode,Tamilnadu, India
mskumar_03@yahoo.com csincseasc@gmail.com

ABSTRACT The DDoS problem has attracted much attention


The usage of internet is growing day-by-day and so are the from the research community recently. In our observation,
vulnerabilities in it. The Distributed Denial-of-Service, which there are three major branches of research in DDoS, namely,
is a major concern of advanced networking, is dealt in detail. 1) Attack detection,e.g., by monitoring protocol behavior[2],
The detection and prevention of DDoS attacks using packet 2) Attack trace back, e.g., by packet marking [3], and
filtering technique is Discussed in this research paper. 3) Attack traffic filtering by Packet Score scheme
Distributed Denial-of-Service (DDoS) attacks are a critical
threat to the Internet. This research work introduces a DDoS 2. REVIEW OF LITERATURE
defense scheme that supports automated online attacks Distributed denial-of-service attacks (DDoS) pose
characterizations and accurate attack packet discarding based an immense threat to the Internet, and consequently many
on statistical processing. The key idea is to prioritize a packet defense mechanisms have been proposed to combat them.
based on a score which estimates its legitimacy given the Attackers constantly modify their tools to bypass these
attribute values carries. Once the score of a packet is security systems, and researchers in turn modify their
computed, proposed scheme performs score-based selective approaches to handle new attacks. The DDoS field is
packet discarding where the dropping threshold is evolving quickly, and it is becoming increasingly hard to
dynamically adjusted based on the score distribution of recent grasp a global view of the problem. This thesis strives to
incoming packets and the current level of system overload. introduce some structure to the DDoS field by developing
The proposed research work describes the design and a taxonomy of DDoS attacks and DDoS defense systems.
evaluation of automated attack characterizations, selective The goal of the research is to highlight the important features
packet discarding, and an overload control process. Special of both attack and security mechanisms and stimulate
considerations are made to ensure that the scheme is amenable discussions that might lead to a better understanding of the
to high-speed hardware implementation through scorebook DDoS problem.
generation and pipeline processing. A simulation study The proposed taxonomies are completed in the
indicates that Packet Score is very effective in blocking following sense: the attack taxonomy covers known attacks
several different attack types under many different conditions. and also those that have not currently appeared but are
potential threats that would affect current defense
INDEX TERMS mechanisms; the defense systems taxonomy covers not only
Network Security and Protection, Performance Evaluation, published approaches but also some commercial approaches
Traffic Analysis, Network Monitoring, Simulation. that are sufficiently documented to be analyzed. Along with
classification, we emphasize important features of each attack
1. INTRODUCTION or defense system category, and provide representative
One of the major threats to cyber security is examples of existing mechanisms.
Distributed Denial-of-Service (DDoS) attacks in which victim
networks are bombarded with a high volume of attack packets 3. EXISTING SYSTEM
originating from a large number of machines. The aim of such The existing method like attack detection helps to
attacks is to overload the victim with a flood of packets and minimize the traffic flow of the network by discarding the
render it incapable of performing normal services for data packets and allows only a particular number of data
legitimate users. In a typical three-tier DDoS attack, the packets to flow over the network. The main drawback existing
attacker first compromises relay hosts called agents, which in system is that the legitimate packets get discarded while the
turn compromise attack machines called zombies that transmit process is done. This disturbs the Time-To-Live (TTL) of the
attack packets to the victim. Packets sent from zombie data packet and the legitimate packets get dropped.
machines may have spoofed source IP addresses to make The attack trace back method works on the principle
tracing difficult [1]. DDoS attacks can be launched by of tracing back the illegitimate data packets which tend to
unsophisticated casual attackers using widely available DDoS attack. But in reality, the hackers tend to increase the volume
attack tools such as Trinoo, TFN2K, Stachedraht, etc. of data packets and thus making the trace back [3] method in-

72 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security
National Conference on Research Issues in Image Analysis & Mining Intelligence (NCRIIAMI-2015)

efficient. This is possible in the case of tracing a data packet 4.2 Conditional Legitimate Probability
within the boundary of the ISP. In the case of Internet as such, The most challenging issue in blocking DDoS
it is impractical to trace back the illegitimate data packets. attacks is to distinguish attack packets from legitimate ones.
To resolve this problem, we utilize the concept of Conditional
4. PROPOSED METHODOLOGY Legitimate Probability (CLP) for identifying attack packets
Recently, have proposed a defense scheme based on probabilistically. The CLP is produced by comparing traffic
distributed detection and automated on-line attack characteristics during the attack with previously measured,
characterization [4]. The proposed scheme consists of the legitimate traffic characteristics. The viability of this approach
following 3 phases: is based on the premise that there are some traffic
y Detect the onset of an attack and identify the victim characteristics that are inherently stable during normal
by monitoring four key traffic statistics of each protected network operations of a target network.
target while keeping minimum per-target states. To formalize the concept of CLP, consider all the
y Differentiate between legitimate and attacking packets destined for a DDoS attack target. Each packet would
packets destined towards the victim based on a readily- carry a set of discrete-value attributes A;B;C;.... For example,
computed, Bayesian-theoretic metric of each packet. The A might be the protocol type might be the packet size, C
metric is of each packet called Conditional Legitimate might be the TTL values, etc. We defined {a1,a2,a3} as
Probability (CLP). possible values for attribute A {b1,b2,b3.} as possible
y Discard packets selectively by comparing the CLP of values for attribute B and so on. During an attack there are Nn
each packet with a dynamic threshold. The threshold is legitimate packets and Na attack packets arriving in T seconds
adjusted according to (1) the distribution of all suspicious totaling Nm.
packets and (2) the congestion level of the victim. Nm=Nn+Na
One of the key concepts in Packet Score is the (m for measured, n for normal, and a for attack)
notion of Conditional Legitimate Probability (CLP) based Cn ( A = ai) = the number of Legitimate packets with attribute
on Bayesian theorem. CLP indicates the likelihood of a packet value ai for attribute A
being legitimate by comparing its attribute values with the Nn = Cn ( A = a1 ) + Cn ( A = a2 ) ++ Cn ( A = ai ) +.
values in the baseline profile. Packets are selectively Cn ( B = b1 ) + Cn ( B = b2 ) ++ Cn ( B = bi ) +.
discarded by comparing the CLP of each packet with a
Na= Ca ( A = a1 ) + Ca ( A = a2 ) ++ Ca ( A = ai ) +.
dynamic threshold. The concept of using a baseline profile
with Bayesian theorem has been used previously in anomaly- Ca( B = b1 ) + Ca ( B = b2 ) ++ Ca ( B = bi ) +.
based IDS (Intrusion Detection System) applications [5], Nm = Cm ( A = a1 ) + Cm ( A = a2 ) ++ Cm ( A = ai ) +.
where the goals are generally attack detection rather than real-
Cm ( B = b1 ) + Cm ( B = b2 ) ++ Cm( B = bi ) +.
time packet filtering.
In this research, the basic concept to a practical real- Pn the ratio or the probability of attribute values among the
time packet filtering scheme using elaborate processes is legitimate packets
extended. In this method, the Packet Score operations for The Conditional Legitimate Probability (CLP) is
single-point protection are described, but the fundamental defined as the probability of a packet being legitimate given
concept can be extended to a distributed implementation for its attributes:
core-routers. The concept of Conditional Legitimate CLP (packet p) = P (packet p is legitimate | ps
Probability (CLP) is described. Focus on the profiling of attribute A = a p , attribute B = b p ....) .
legitimate traffic characteristics is made. Score assignment to
According to Bayes theorem, the conditional probability of
packets, selective discarding, and overload control are
an event E to occur given an event F is defined as :
described. The performance of the standalone packet filtering
P( E F )
scheme is evaluated. Some of the important issues related to
P( E | F ) = .
the Packet Score scheme are discussed. Finally, the direction P( F )
of future investigation is concluded.
Therefore, CLP can be rewritten as follows :
N n xPn ( A = a p , B = b y ,...) / N m
4.1. Scoring Packets CLP = .... (1)
The Packets are scoring Statistics in Conditional N m xPm ( A = a p , B = b p ,...) / N m
Legitimate Probability Based on Bayesian Theorem. The then (1) can be further rewritten as :
concept of using a baseline profile with Bayesian theorem has
N n xPn ( A = a p ), Pn ( B = b p ) x...
been used previously in anomaly-based IDS (Intrusion
CLP( p ) = (2)
Detection System) applications, where the goals are generally N m xPm ( A = a p ) xPm ( B = b p ) x...
attack detection rather than real-time packet filtering.
One of the terms in (2), e.g.,
Pn ( A = a p ) / Pm ( A = a p ) is called a partial score.

73 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security
National Conference on Research Issues in Image Analysis & Mining Intelligence (NCRIIAMI-2015)

4.3 Estimating Legitimate Distribution. 4.5 Traffic Profile Stability


The above equation (2) shows that calculate for the Pocket score depends on the stability of the traffic
probability of a packets legitimacy by observing the profile for estimating Pn. It has been known that for a given
probability of the attribute values in legitimate traffic (Pn) and subnet, there is a distinct traffic pattern in terms of packet
in total traffic (Pm). However, it is practically impossible to attribute value distribution for a given time and / or given
know how many packets are legitimate during the attack day[6],[7],[8]. In general the nominal traffic profile is
period, let alone the number of legitimate packets bearing a believed to be a function of time which exhibits periodic, time
particular attribute value. For that reason, utilize an estimate of day, and day-of-the-week variations as well as long
Pn is called nominal profile and is collected in advance during term trend changes.
normal operation.
A nominal traffic profile consists of single and joint Table 3.Trace Data Sites from NLANR
distributions of various packet attributes that are q considered
unique for a site. Candidate packet attributes form IP headers
are:
Packet size
Time-to-Live (TTL) values,
Protocol type values, and
Sources IP prefixes.
Those from TCP headers are: To further verify traffic profile stability, conducted
TCP flag patterns and an analysis with a packet trace data available from NLANR
Server port numbers, i.e., the smaller of the packet trace archives [9], all trace data were collected for 90
source port number destination port number. seconds from 17 sites within the US with the link speed
ranging from OC-3 to OC-48.We randomly selected the four
Table 1.An Example of a Nominal Profile
sites s in tables 3 and total of 49 trace file were downloaded
for analysis.
Table 4.Some Download Traces.

Table 2.Profile Storage Requirements Different Iceberg


Selection Methods

4.4. Profile Structure


Due to the number of attributes to be in the profile A quick examination of table 4 revealed that each
and the large number of possible attribute value of each site had a distinct traffic composition. Especially, observed
attribute, especially for the joint attributes, and efficient data that the traffic in AIX mostly GRE rather than TCP ir UDP.
structure is required to implement the profile. The attribute To understand the stability within one trace in
values for TTL are0, 1,2,..255, thus there are 256 possible Fig.1a is storage then the stability for multiple days in fig.1c.
attribute values. Each attribute value has a ratio in the profile Fig.1b compared seven from 26 September 2005 to
as illustrated in table 1 (e.g., 1.1 percent for TTL value1). To 2 October 2005 at approximately 9:00 a.m. (morning) and
reduce the storage space, we use iceberg-style profile[2], in 8:00p.m (evening) each day. It indicates that there is moderate
which only most frequently occurring attribute values are correlation among the daily profiles that is compared with
stored along with their ratio. itself, the SL =1. It also shows that there is a higher correlation
The ice-based profile is similar to where its one- for the same time of day (approximately 9:00 a.m) than at a
dimensional cluster and multidimensional clusters are different time of day (approximately 8:00 p.m). fig. 1.c
comparable to our single attributes and joint attributes, compared the profiles for seven Tuesday at the same time of
respectively. day (approximately 9:00a.m) from 23 August 2005 although it
spans seven weeks, it still shows a similar correlation to the
short term profiles of 9:00a.m. as in Fig.1a. These seven
Tuesday morning profiles are slightly closer than the evening

74 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security
National Conference on Research Issues in Image Analysis & Mining Intelligence (NCRIIAMI-2015)

profiles in Fig.1b. However, in fig. 1d, when compared with scorebooks increases. On the other hand, generating a log-
other sites, the SL is much lower, showing a mush weaker version scorebook may take longer than a regular scorebook
correlation. generation. However, the scorebook is generated only once at
the end of each period and it is not necessary to observe every
packet for scorebook generation; thus, some processing delay
can be allowed. Furthermore, scorebook generation can be
easily parallelized using two processing lines, which allows
complete sampling without missing a packet. Into subtraction
/addition operation. With the help of the logarithmic version
of (2) as shown below:

log[CLP(p)=[log(Pn(A=ap)) log(Pm(A= ap))]+


[log(Pn(B=bp)) log(Pm(B= bp))]+
[log(Pn(C=cp)) log(Pm(C= cp))]+
Fig 1: Traffic profile stability comparisons.
Construct a scorebook for each attribute that maps different
(a) Consecutive 10-second windows. (b) Seven consecutive
values of the attribute to a Specific partial score. For instance,
days in a week. (c) Seven consecutive Tuesdays. (d) Four
the partial score of a packet with attribute. A equal to ap is
different sites for seven consecutive days in a week.
given by [log(Pn(A=ap)) log(Pm(A= ap))]. Since Nn / Nn in
(2) is constant for all packets within the same observation
4.6 Log-Version CLP period, it can be ignored when comparing and prioritizing
To make it is more suitable for real-time processing, packets based on their CLP values.
conversion of floating-point division/multiplication operations
into subtraction/addition operations is made. Scoring a packet Table 5. Partial Scorebook for one attribute (eg TTL)
is equivalent to looking up the scorebooks, e.g., the TTL
scorebook, the packet size scorebook, the protocol type
scorebook, etc. After looking up the multiple scorebooks, the
matching CLP entries in a log-version scorebook are added.
This is generally faster than multiplying the
matching entries in a regular scorebook. The small speed
improvement from converting a multiplication operation into
an addition operation is particularly useful because every
single packet must be scored in real-time. This speed
improvement becomes more beneficial as the number of

Fig 2: diagram for Packet Filtering Method

75 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security
National Conference on Research Issues in Image Analysis & Mining Intelligence (NCRIIAMI-2015)

4.7 The Integrated Process 5. PACKET FILTERING METHOD


Fig.2 depicts the integrated operation between CLP PROPOSED ALGORITHM
computation and the determination of a dynamic discarding
STEP 1: Start the process.
threshold for CLP. A load-shedding algorithm, such as the
one described [10] in, is used to determine the amount ( ) of
STEP 2: Trace the incoming packets using jpcap (java
component used to capture the packets).
suspicious traffic arriving that needs to be discarded in order
STEP 3: Assign scores to the packets based on the Log
to keep the utilization of the victim below a target value.
Typical inputs to a load-shedding algorithm include current Conditional Legitimate Probability.
utilization of the victim, maximum (target) utilization allowed STEP 4: Fix the threshold value which depends on the
for the victim, and the current aggregated arrival rate of parameters like protocol type, packet size,
suspicious traffic. Once the required packet discarding STEP 5: Compare the threshold with the packet score.
percentage ( ) is determined, the corresponding CLP STEP 6: If the packet score is greater than the threshold value
discarding threshold, Thd, is determined from a recent then discard the packets.
snapshot of the CDF of the CLP values. STEP 7:Update the cumulative distribution function score and
Specifically, the following three operations are continue Step 3.
performed in pipeline when a packet arrives: STEP 8: Stop the process.
1. Incoming packet profiling:
yPackets are observed to update Pm.
yAt the end of the period, Pn/Pm is calculated and 6. PERFORMANCE EVALUATION
scorebooks are generated. The profiler program that generated the nominal
2. Scoring: profile from the packet trace consumed about 1.5 MB of
yThe packets are scored according to the most memory space. For each 10-second window comprising about
recent scorebooks. 50,000 packets, the program is executed for approximately 0.5
yAt the end of the period, CDF is generated and the seconds on a 1.5 GHz Intel Pentium PC. The Packet Score
cut-off threshold is calculated. filtering program reads the nominal profile and packet traces,
3. Discarding: generated the attack packets, created scorebooks, and selected
yThe packets are scored according to the most the packets to drop. It consumed about 40MB of memory
recent scorebooks. space, and executed for approximately 0.5 seconds for each 1-
yThe packet is discarded if its score is below the second window comprising 5,000 legitimate and 150,000
cut-off threshold score. attack packets. This amount of traffic was roughly equivalent
Fig. 2 shows the selective discarding of packets generated by to 1-2 Gbps speed. It is believed that the execution time and
packet filter method. memory requirements can be greatly improved by
optimization and hardware support.
4.8 Selective Packet Discarding
Once the score is computed for a packet, selective 6.1 Performance Metrics
packet discarding, and overload control can be performed To evaluate Packet Scores performance, the
using the score as the differentiating metric. Since an exact differences in the score distribution for attack and legitimate
prioritization would require offline, multiple-pass operations, packets were examined first. These differences are quantified
e.g., sorting and packet buffering, the following alternative using two metrics, namely, RA and RL. MinL (MaxA) as the
approach is taken into account. First, the cumulative lowest (highest) score observed for incoming legitimate
distribution function (CDF) of the scores of all incoming (attack) packets and RA (RL) as the fraction of attack
packets in time period (Ti) is maintained. Second, the cut-off (legitimate) packets that had a score below MinL (above
threshold score is calculated. Third, the arriving packets in MaxA) were defined. The closer RA and RL are to 100
time period T (i+1) if its score value is below the cut-off percent, the better the score-differentiation power. In practice,
threshold are discarded. In Fig -3 at the same time, the packets score distributions usually have long, thin tails due to very
arriving at T (i+1) create a new CDF. few outlier packets with extreme scores. To avoid the masking
effect of such outliers, MinL(MaxA) to be the first (99th)
percentile of the score distribution of legitimate (attack)
packets are taken for consideration. Since the shape of
distribution changes per period, the average of RA and RL are
taken over multiple periods. While RA and RL can quantify
the score differentiation power, the final outcome of selective
discarding also depends on the dynamics of the threshold
update. The false positive (i.e., legitimate packets getting
falsely discarded), and false negative (i.e., attack packets
getting falsely admitted) ratios were also measured. To check
the effectiveness of the overload control scheme, the actual
Fig 3: Selective packet discarding

76 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security
National Conference on Research Issues in Image Analysis & Mining Intelligence (NCRIIAMI-2015)

output utilization _out against the target utilization _target Table 6. Performance against Different Types of Attack.
were compared. The fig-4 shows a typical score distribution
trend over 25 periods under a DDoS attack. The attack
packets (high peaks on the upper left side) are concentrated in
the lower score region while legitimate packets (lower right
side) have higher scores. The black bar between the two
groups represents the cutoff threshold scores for the discard
decision, which removes the majority of the attack traffic.
Packet Score tracks the score distribution change in each
period and adjusts the cutoff threshold accordingly.
However, Packet Score is designed to accept
specific traffic only up to the maximum ratio that was
observed in the past. Therefore, DDoS traffic beyond this
ratio will be properly filtered by Packet Score and, thus,
cannot succeed in a massive attack. Nevertheless, accepting
DDoS traffic is not desirable as it wastes bandwidth. This
situation can be improved by constructing a cleaner nominal
profile that contains less DDoS traffic. A cleaner profile can
be made one of two ways.
First, the packet trace data can be analyzed to
identify legitimate flows that show proper two-way
Fig 4: Score distribution over time communication behavior. The packets from the legitimate
flows are used for constructing the profile. Although some
6.2 Performance under Different Attack traffic flows that do not have continuous packet exchange,
such as ICMP, may be left out, Packet Score filtering is
Types already based on an iceberg-style histogram with default value
The results are described in Table 5. In most cases,
assignment for no iceberg items. The impact on performance
RA and RL were above 99 percent and false positives were
of missing some of the packets should be minimal.
kept low. The result was substantially better than random
Second, a generic profile to remove those packets
packet dropping of which the false positive ratio is expected
that are more likely to be attack packets can be used first, and
to be 90.7 percent, and better than some of previous methods.
use the remaining packets to create the final nominal profile.
Furthermore, pout was successfully kept close to its target
The generic profile reflects overall Internet traffic
value in most cases. The false negative ratios were mainly due
characteristics, e.g., TCP versus UDP ratio, common packet
to the gap between ptarget and plegitimate. Although the
size, common TCP flags, etc. Our preliminary research shows
signature of the TCP-SYN flood packets can easily be derived
that this two-step profiling is very effective to fight generic
by any filtering scheme, the ability of PacketScore to
attacks. Further study is needed on these methods.
prioritize legitimate TCP-SYN packets over attack packets
based on other packet attributes is an essential feature.
7. CONCLUSION & FUTURE SCOPE
Without such prioritization PacketScore did show some
The Packet Score scheme is used to defend against
degradation under nominal attack when the TTL values were
DDoS attacks. The key concept in Packet Score is the
randomized. Legitimate packets having the same
Conditional Legitimate Probability (CLP) produced by
characteristics as attack packets were penalized, but such
comparison of legitimate traffic and attack traffic
chances were still quite small, and the false positive ratio was
characteristics, which indicates the likelihood of legitimacy of
kept to 3.30 percent. When the TTL values were fixed, the
a packet. As a result, packets following a legitimate traffic
performance was better. The TTL value 118 accounted for the
profile have higher scores, while attack packets have lower
largest portion of traffic (29.86 percent) and 100 had a ratio
scores. This scheme can tackle never-before-seen DDoS
under the adaptive threshold (0.29 percent) for 99 percent
attack types by providing a statistics-based adaptive
coverage. As PacketScore utilizes more attributes, the
differentiation between attack and legitimate packets to drive
performance should become better. Only one joint attribute
selective packet discarding and overload control at high-
was used in this experiment, numerous combinations of
speed.
attributes are possible to increase the number of attributes, and
Packet Score is capable of blocking all kinds of
the performance under nominal attack can be further
attacks as long as the attackers do not precisely mimic the
improved. It is interesting to see how the scores of each attack
sites traffic characteristics. The performance and design
type are distributed.
tradeoffs of the proposed packet scoring scheme in the context
of a stand-alone implementation is studied. By exploiting the
measurement/scorebook generation process, an attacker may
try to mislead Packet Score by changing the attack types
and/or intensities. Easily overcome such an attempt by using a

77 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security
National Conference on Research Issues in Image Analysis & Mining Intelligence (NCRIIAMI-2015)

smaller measurement period to track the attack traffic pattern [4]. W.C. Lau, M.C. Chuah, H.J. Chao, Y. Kim, PacketScore
more closely. Currently investigating the generalized a proactive defense scheme against Distributed Denial of
implementation of Packet Score for core networks. Service Attacks, NSF proposal under submission.
[5]. M. Mahoney and P.K. Chan, Learning Nonstationary
Packet Score is suitable for the operation at the core
Models of Normal Network Traffic for Detecting Novel
network at high speed, and we are working on an enhanced Attacks, Proc. ACM 2002 SIGKDD, pp. 376-385, 2002.
scheme for core network operation in a distributed manner. In [6]. J. Jung, B. Krishnamurthy, and M. Rabinovich, Flash
particular, we plan to investigate the effects of update and Crowds and Denial of Service Attacks: Characterization and
feedback delays in a distributed implementation, and Implications for CDNs and Web Sites, Proc. Intl World
implement the scheme in hardware using network processors. Wide Web Conf., May 2002.
Second, Packet Score is designed to work best for a large [7]. D. Liu and F. Huebner, Application Profiling of IP
Traffic, Proc. 27th Ann. IEEE Conf. Local Computer
volume attack and it does not work well with low-volume
Networks (LCN), 2002.
attacks. Packet Score performance in the presence of such [8]. D. Marchette, A Statistical Method for Profiling
attack types, e.g., bandwidth soaking attacks [11] or low-rate Network Traffic, Proc. First USENIX Workshop Intrusion
attacks [12] can be explored and improved. Finally, a Detection and Network Monitoring, Apr. 1999.
thorough investigation on the stability of traffic characteristics [9]. NLANR PMA Packet Trace Data,
shall be performed. And ensures the objective of the proposed http://pma.nlanr.net/Traces, 2006.
research work. [10]. S. Kasera et al., Fast and Robust Signaling Overload
Control, Proc. Intl Conf. Network Protocols, Nov. 2001.
[11]. Y. Xu and R. Guerin, On the Robustness of Router-
8. REFERENCES Based Denialof- Service (DoS) Defense Systems, ACM
[1]. D. Moore, G.M. Voelker, and S. Savage, Inferring SIGCOMM Computer Comm. Rev., vol. 35, no. 3, July 2005.
Internet Denialof-Service Activity, Proc. 10th USENIX [12]. Y. Kim, W.C. Lau, M.C. Chuah, and H.J. Chao,
Security Symp., Aug. 2001. PacketScore: Statistics-Based Overload Control against
[2]. H. Wang, D. Zhang, and K.G. Shin, Change-Point Distributed Denial-of- Service Attacks, Proc. IEEE
Monitoring for the Detection of DoS Attacks, IEEE Trans. INFOCOM, Mar. 2004.
Dependable and Secure Computing, vol. 1, no. 4, Oct.-Dec.
2004.
[3]. S. Savage, D. Wetherall, A. Karlin, and T. Anderson,
Network Support for IP Trace back, IEEE/ACM Trans.
Networking, vol. 9, no. 3, June 2001.

78 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security
National Conference on Research Issues in Image Analysis & Mining Intelligence (NCRIIAMI-2015)

Scalable, Power Aware and Efficient Cluster for Secured


MANET
M.Anandhi, Prof.Dr.T.N.Ravi
Assistant Professor, Assistant Professor,
Department of Information Technology, Department of computer science,
Cauvery College for women, Trichy. Periyar E.V.R College(Autonomous),Trichy
anandhia@yahoo.com proftnravi@gmail.com

ABSTRACT with the wired network, the mobile ad hoc network will need
Mobile Ad-Hoc network is a collection of wireless mobile nodes more robust security scheme to ensure the security of it.
dynamically forming a network without any existing
infrastructure and the relative position dictate communication 1.3 Types of attacks
links. Ad-hoc networks are wireless networks where nodes The nature and structure of networks in MANET makes it
communicate with each other using multi-hop links. Due to attractive to various types of attackers. Different types of
nodal mobility, the network topology may change rapidly and attacker attempts different approaches to decrease the network
unpredictably overtime. Clustered based secure structure has performance, throughput. On the basis of the nature of attack,
evolved as an important research topic in MANET. These papers the attacks against MANET may be classified into Legitimate
discussed the various clustering methods and propose a secured Based and Interaction based attacks.
cluster to improve the performance of large MANETs.
Legitimate based attacks are divided into i) External attacks -
Keywords committed by nodes that are not legal members of the network.
MANET, Cluster, Scalable, attacks, Mobility, Power aware. ii) Internal attacks are from a compromised member inside the
network and they are not easy to prevent or detect. Interaction
Based attacks are again classified into
1 INTRODUCTION
1.1.MANET i)Passive Attack:
The network is an autonomous transitory association of mobile Passive attacks are the attack that does not disrupt
nodes that communicate with each other over wireless links. Proper operation of network.
Nodes that lie within each others send range can communicate Attackers snoop data exchanged in network without
directly and are responsible for dynamically discovering each altering it.
other. In order to able communication between nodes that are Requirement of confidentiality can be violated if an
not directly within each others send range, intermediate nodes attacker is also able to interpret data gathered through
act as routers that relay packets generated by other nodes to their snooping.
destination. These nodes are often energy constrainedthat is, Detection of these attacks is difficult since the
battery-powereddevices with a great diversity in their operation of network itself does not get affected.
capabilities. Furthermore, devices are free to join or leave the ii)Active Attack :
network and they may move randomly, possibly resulting in Active attacks are severe attacks which are performed
rapid and unpredictable topology changes. In this energy- by attackers for replicating, modifying and deletion of
constrained, dynamic, distributed multi-hop environment, nodes exchanged data.
need to organize themselves dynamically in order to provide the They try to change the behavior of the protocols.
necessary network functionality in the absence of fixed These attacks meant to degrade or prevent the message
infrastructure or central administration. flow among nodes

1.4 Security Goals


1.2 Vulnerabilities of MANET
In MANET there is no clear line of defense because of the
freedom for the nodes to join, leave and move inside the 1.4.1 Availability
network; some of the nodes may be compromised by the The term Availability means that a node should maintain its
adversary and thus perform some malicious behaviors that are ability to provide all the designed services. Availability applies
hard to detect; lack of centralized machinery may cause some both to data and to services. It ensures the survivability of
problems when there is a need to have such a centralized network service despite denial of service attack.
coordinator; restricted power supply can cause some selfish
problems; and continuously changing scale of the network has 1.4.2 Confidentiality
set higher requirement to the scalability of the protocols and Confidentiality means that certain information is only accessible
services in the mobile ad hoc network. As a result, compared to those who have been authorized to access it. In order to
maintain the confidentiality of some confidential information,

79 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security
National Conference on Research Issues in Image Analysis & Mining Intelligence (NCRIIAMI-2015)

we need to keep them secret from all entities that do not have 2 .REVIEW OF LITERATURE
the privilege to access them. Confidentiality is sometimes Mobility-aware clustering [1] indicates that the cluster
called secrecy or privacy architecture is determined by the mobility behavior of mobile
nodes. The idea is that, by grouping mobile terminals with
1.4.3 Integrity similar speed into the same cluster, the intra-cluster links can
Integrity means that assets can be modified only by authorized become more tightly connected. Then the re-affiliation and re-
parties or only in authorized way. Integrity assures that a clustering rate can be naturally decreased.
message being transferred is never corrupted.
In Energy efficient clustering [2] Battery power is very
1.4.4 Authentication important for the mobile nodes in a MANET during operation.
Authentication enables a node to ensure the identity of peer Energy limitation is severe challenge for network performance.
node it is communicating with. It is essentially assurance that Compared to ordinary nodes, cluster head consumes more power
participants in communication are authenticated and not because of its extra work.
impersonators. Authenticity is ensured because only the
legitimate sender can produce a message that will decrypt Any node failure due to energy depletion may cause
properly with the shared key. communication interruption, so it is important to balance the
energy consumption among mobile nodes to avoid node failure,
especially when some mobile nodes bear special tasks or the
1.4.5 Nonrepudiation network density is comparatively sparse.
Nonrepudiation ensures that sender and receiver of a message
cannot disavow that they have ever sent or received such a
Load-balancing clustering algorithm [3] is used to specify the
message. This is helpful when we need to discriminate if a node
optimum number of mobile nodes that a cluster can handle
with some undesired function is compromised or not.
especially in a cluster head based MANET. The system
throughput will be reduced if a too-large cluster may put too
1.4.6 Anonymity heavy load on the cluster head. A too-small cluster, however,
Anonymity means all information that can be used to identify may produce a large number of clusters and thus increase the
owner or current user of node should default be kept private and length of hierarchical routes, resulting in longer end-to-end
not be distributed by node itself or the system software. delay. The load-balancing clustering algorithm limits the
number of mobile nodes for a cluster that a cluster can
effectively handled. When a cluster size exceeds its limit, re-
1.5 Cluster formation of a clustering procedure is invoked to adjust the
The concept of cluster involves taking two or more computers
number of mobile nodes in that cluster.
and organizing them to work together to provide higher
availability, reliability and scalability than can be obtained by
In LowestID clustering [4], each node is assigned a unique ID
using single system.
and broadcast the ID to its entire neighbor node. The node with
lowest ID is selected as cluster head. A node which can
The mobile nodes in a MANET are divided into different virtual
communicate with more than one cluster is known as gateway.
groups, and they are allocated geographically adjacent into the
Mobility-based D-hop Clustering Algorithm (MobDHop)[5]
same cluster according to some rules with different behaviors
dynamically forms variable diameter clusters that are adaptive to
for nodes included in a cluster from those excluded from the
node mobility patterns. It allows the cluster members at most d-
cluster.
hops away from their cluster heads to control the cluster size and
manage cluster head density. The cluster head election is
1.6 Advantages of clustering dependent on three mobility metrics, namely (a) Variation of
First, a cluster structure facilitates the spatial reuse of resources estimated distance between nodes overtime (VD) which
to increase the system capacity. A cluster can better coordinate estimates the relative mobility of two nodes. (b) Local
its transmission events with the help of a special mobile node, Variability (LV) which calculates the mean of VD of all the
such as a cluster head, residing in it. This can save much neighbors for a node (c) Group variability (GV) which
resources used for retransmission resulting from reduced calculates the mean of LV of its 1-hop neighbors. A node that
transmission collision. The second benefit is in routing, because has the lowest variability value (LV) assumes the role of cluster
the set of cluster heads and cluster gateways can normally form head.
a virtual backbone for inter-cluster routing, and thus the
generation and spreading of routing information can be In Trust-based clustering [6] scheme, each node evaluates the
restricted in this set of nodes. Last, a cluster structure makes an trust value of neighbor nodes and recommends one of neighbors
ad hoc network appear smaller and more stable in the view of that have the highest trust value as its trust guarantor. Then
each mobile terminal. When a mobile node changes its attaching recommender node becomes a member of CH node which is
cluster, only mobile nodes residing in the corresponding clusters one-hop away and, the recommender would have to recommend
need to update the information. other nodes again. When nodes recommend a CH, they give a
recommendation certificates called R-Certificate to the CH.
This paper organizes as follows: Section 2 reviews the These certificates are used to authenticate the CH. So, the CH
literatures, Section 3 discusses the proposed method, Section 4 which has many recommendation certificates considered as
concludes the paper and finally Section 5 lists the references. more trustable node in the ad hoc networks. In the

80 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security
National Conference on Research Issues in Image Analysis & Mining Intelligence (NCRIIAMI-2015)

recommendation certificate, the period is the term of validity of Step 2: If it is less, then it send battery low signal to its neighbor
the certification. nodes.
Step 3: Re-calculate the global weight and select the next CH.
Weighed cluster algorithm (WCA) proposed for MANET [7]
elects the CH based on factors like mobility, ability to handle
3.1.4 Mobility calculation phase
nodes, communication range, and so forth. The algorithm
Step 1: Calculate Nodes hop counts from CH between a
calculates the average weight of each node based on the
particular time interval t and t-1.
provided factors. The node with the minimum weight is selected
as a cluster head.
3.1.5 Reputation Index phase(Integrity)
The goal of the Highest-Degree algorithm [8] is to minimize the Step 1: Initialize Reputation Index (RI) value for each node.
number of clusters, which is achieved as follows. Each node is Step 2: Increment RI value of every upstream or relay node if it
aware of the number of its neighbor nodes, which acquired by receives acknowledgement receipt of packets.
interactively exchanging control messages. The node having the Step 3: Decrement RI value if do not receive acknowledge from
highest number of neighbors, i.e. the highest degree, is elected downstream node or it flood the path.
as the cluster-head. Step 4: The node with negative RI value is excluded in the next
route discovery process.
Secured Clustering Algorithm for Mobile Ad Hoc Networks [9] Step 5: Consider the node with negative RI value as selfish node
is a secured weight-based clustering algorithm allowing more or attacker.
effectiveness, protection and trust in the management of cluster
size variation. It includes security requirements by using a trust
value defining how much a node is trusted by its neighbor hood, 4.CONCLUSION
and using the certificate as nodes identifier to avoid any Security has become a primary concern due to the MANET
possible attacks. characteristics like dynamic, infrastructure less, having no
centralized authority, mobility and scalability. This paper
SEEC-Signal and Energy Efficient Clustering algorithm [10] discusses the issues in MANET, compares various clustering
is preventing the cluster head and re-selection of the cluster head schemes and proposed a secured, scalable, energy efficient and
when energy and signal fall to the threshold value. Cluster head mobility based clustering algorithm for MANET.
formation will be done when the energy and signal reach to the
threshold value. The Node which has high energy level and
signal level is elected as the cluster head. 5. REFERENCES
[1].P. Basu, N. Khan, and T. D. C. Little, A Mobility Based
The above algorithms are analyzed and compared in Table 1. Metric for Clustering in Mobile Ad Hoc Networks, in Proc.
From this study, several factors that are not considered are IEEE ICDCSW01, Apr. 2001, pp. 41318
identified and proposed a new method which can handled all [2].J.-H. Ryu, S. Song, and D.-H. Cho, New Clustering
those factors is given below. Schemes for Energy Conservation in Two-Tiered Mobile Ad-
Hoc Networks,in Proc. IEEE ICC01, vo1. 3, June 2001, pp.
86266.
3. PROPOSED METHOD [3]A. D. Amis and R. Prakash, Load-Balancing Clusters in
The proposed method suggests an algorithm to create a clustered Wireless Ad Hoc Networks, in Proc. 3rd IEEE ASSET00,
architecture which is to concentrate on security. It is used to Mar. 2000pp. 2532.
identify malicious node and avoid attacks. [4].C. R. Lin, M. Gerla, Adaptive clustering for mobile
wireless networks, IEEE Journal on Selected Areas in
3.1 Algorithm Communications, pp.1265-1275, 1997.
[5]. I. Er and W. Seah, Mobility-based d-hop clustering
3.1.1Cluster formation Phase algorithm for mobile ad hoc networks, in Wireless
Step 1: Generate an ID for every newly entered node using
Communications and Networking Conference, 2004. WCNC.
Random number generator (any nodes can be entered or exited.
2004 IEEE, vol. 4, 2004, pp. 23592364.
So it supports scalable cluster)
[6]. Pushpita Chatterjee, Indirani sengupta, S.K.Ghosh,A
Step 2: The weights of each node is calculated based on battery
Trust Based Clustering Framework for Securing Ad Hoc
power, stability, Memory capacity, Energy and degree.
Networks,Information system, technology and management
Step 3: Set Maximum and minimum load for cluster.
communication, Information and computer science, Volume
Step 4: If Number of nodes exceeds the limits, then create new
31.
cluster.
[7]Chatterjee, M., Das, S., & Turgut, D. WCA: A Weighted
The ID of the node is used to maintain Authentication.
Clustering Algorithm (WCA) for mobile ad hoc networks.
Cluster Computing (PP. 193- 204). Kluwer Academic, 2002.
3.1.2 Cluster Head (CH) selection phase [8] M. Gerla and J. T. Tsai, Multiuser, Mobile, Multimedia
Step 1: Set a node with highest weight as cluster head. Radio Network, Wireless Networks, vol. 1, pp. 25565, Oct.
1995
3.1.3 Re-election Phase [9]Yao Yu+, Lincong Zhang A Secure Clustering Algorithm in
Step 1: Check the battery power of CH with threshold for Mobile Ad Hoc Networks IPCSIT vol. 29 (2012).
regular time interval.

81 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security
National Conference on Research Issues in Image Analysis & Mining Intelligence (NCRIIAMI-2015)

[10]. Alak Roy, Manasi Hazarika2, Mrinal Kanti Debbarma3 [11].M.Anandhi, Prof.T.N.Ravi, An Appraisal of Attacks in
Energy Efficient Cluster Based Routing in MANET MANET and Fortification Methods, IJARCS, Volume 5, No. 8,
2012,International Conference on Communication, Information Nov-Dec 2014, ISSN No. 0976-5697
& Computing Technology (ICCICT), Oct. 19-20.

Table 1. Comparison of various clustering algorithm


S.N Clustering CH Selection Merits Demerits S P M L
O Method
1. Lowest ID Head Node with Simple to implement Generate more CH, x X x x
lowest ID Power reduced for nodes
due to serve as CH

2. Highest degree Node with more Less hop needed to satisfy Topology changes occurs x X x x
neighbors any request due to mobility
3. Weighed cluster Nodes with minimum maximize the stability of Complexity in cluster x
algorithm weight cluster topology maintenance

4. Trust based The node with highest Authenticate the valid Mobility and Power X x x
clustering Recommendation node energy is not considered
certificate
5. Signal and Nodes with higher Pay attention for CH Complexity in x x x
energy efficient energy and signal level election maintenance
Avoid re-election
6. Load balancing Node with optimal load Each cluster handle Scalability not supported, X x x
limited nodes High maintenance
7. Mobility based Most stable node Focuses to make cluster It assumes each node can x X
d-hop clustering diameter more flexible. measure its received
It considers cluster signal strength depending
maintenance during its upon the closeness
third phase. between two nodes.

S-Scalability P-Power aware M-Mobility L-Low Maintenance

82 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security
National Conference on Research Issues in Image Analysis & Mining Intelligence (NCRIIAMI-2015)

Analysis of Limitation in Cryptography Techniques and


Application
Dr. Antony Selvadoss Thanamani K.Priyanka
Dept. Of CS, NGM College, Bharathiar University, M.Phil Scholar, Folio: B1/373/2014, Dept. Of CS,
Pollachi, India NGM College, Bharathiar University, Pollachi, India
selvdoss@gmail.com priyanka.skasc08ss@gmail.com

ABSTRACT algorithm and a key. An encryption algorithm means the


Network Security is the most vital component in information technique that has been used in encryption. Encryption takes
security because it is responsible for securing all information place at the sender side.
passed through networked computers. Cryptography is an
1.1.4. Decryption
emerging technology, which is important for network
security. Research on cryptography is still in its developing A reverse process of encryption is called as Decryption. It is a
stages and a considerable research effort is still required for process of converting Cipher Text into Plain Text.
secured communication. It has emerged as a secure means for Cryptography uses the decryption technique at the receiver
transmission of information. It mainly helps in curbing side to obtain the original message from non readable
intrusion from third party. It provides data confidentiality, message or Cipher Text. The process of decryption requires
integrity, electronic signatures, and advanced user two things- a Decryption algorithm and a key. A Decryption
authentication. The methods of cryptography use mathematics algorithm means the technique that has been used in
for securing the data. In this research the limitation of Decryption. Generally the encryption and decryption
different type of cryptographic techniques and its application algorithm are same. Through cryptography a confidential
is analyzed. message are send at unreliable channel. Encryption is done
through secret key. Secret key is the main requirement of
Keywords
symmetric key. Secret key is used for encryption and
Cryptography, Data Confidentiality, Electronic signatures, decryption. Cryptography is a method of storing and
Authentication. transmitting data in a particular form so that only those for
whom it is intended can read and process it.
1. INTRODUCTION
1.1.5. Secret key
Information security plays a pivotal role during internet
communication in todays era of technology. It is Secret key is the main requirement of symmetric key. Secret
tremendously important for people committing e-transactions. key is used for encryption and decryption. The security of
There are various cryptography methods that provide a means secret key is more important as if intruder knows the key
for secure commerce and payment to private communications he/she may analyze the plain text.
and protecting passwords. Cryptography is the practice and
study of techniques for secure communication in the presence
of adversaries. Modern cryptography intersects the disciplines
of mathematics, computer science, and electrical engineering.

1.1 CRYPTOGRAPHY
Cryptography is a method of storing and transmitting data in a
particular form so that only those for whom it is intended can Fig 1: Encryption and Decryption
read and process it.
2. PURPOSE OF CRYPTOGRAPHY
1.1.1 .Plain Text
2.1 Confidentiality
The original message that the person wishes to communicate
with the other is defined as Plain Text. In cryptography the When we talk about confidentiality of information, we are
actual message that has to be send to the other end is given a talking about protecting the information from disclosure to
special name as Plain Text. unauthorized parties. Information has value, especially in
todays world. A very key component of protecting
1.1.2 Cipher Text information confidentiality would be encryption. Encryption
ensures that only the right people can read the information.
The message that cannot be understood by anyone or Information in computer is transmitted and has to be accessed
meaningless message is what we call as Cipher Text. In only by the authorized party and not by anyone else.
Cryptography the original message is transformed into non
readable message before the transmission of actual message. 2.2. Integrity
1.1.3 Encryption Only the authorized party is allowed to modify the transmitted
information. No one in between the sender and receiver are
A process of converting Plain Text into Cipher Text is called allowed to alter the given message. Integrity of information
as Encryption. Cryptography uses the encryption technique to refers to protecting information from being modified by
send confidential messages through an insecure channel. The unauthorized parties. Information only has value if it is
process of encryption requires two things- an encryption

83 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security
National Conference on Research Issues in Image Analysis & Mining Intelligence (NCRIIAMI-2015)

correct. Information that has been tampered with could prove


costly.

2.3 Non-repudiation
Ensures that neither the sender, nor the receiver of message
should be able to deny the transmission. Non-repudiation is
the assurance that someone cannot deny something. Typically,
Non-repudiation refers to the ability to ensure that a party to a
contract or a communication cannot deny the authenticity of Fig 2: Symmetric Key Cryptography
their signature on a document or the sending of a message that
they originated. A symmetric cryptosystem is faster. In Symmetric
Cryptosystems, encrypted data can be transferred on the link
2.4. Authentication even if there is a possibility that the data will be intercepted.
Since there is no key transmitted with the data, the chances of
The process of proving one's identity. The information data being decrypted are null. A symmetric cryptosystem uses
received by any system has to check the identity of the sender password authentication to prove the receivers identity. A
that whether the information is arriving from a authorized system only which possesses the secret key can decrypt a
person or a false identity. message.
2.5. Access Control 3.1.1 Limitation Symmetric Key Cryptography
Only the authorized parties are able to access the given Symmetric cryptosystems have a problem of key
information. transportation. The secret key is to be transmitted to the
receiving system before the actual message is to be
3. TYPES OF CRYPTOGRAPHY transmitted. Every means of electronic communication is
There are different types of cryptography. There is a sender, insecure as it is impossible to guarantee that no one will
receiver, intruder of information and cryptographic tool that be able to tap communication channels. So the only
prevents intruder from trespass the sensitive information. secure way of exchanging keys would be exchanging
them personally.
Symmetric Key Cryptography or Secret Key
Cryptography Cannot provide digital signatures that cannot be
repudiated.
Asymmetric Key Cryptography or Public Key
Cryptography A private key encryption system is that it requires
anyone new to gain access to the key. This access may
Hash Functions require transmitting the key over an insecure method of
communication.
Key Escrow Cryptography
Translucent Cryptography 3.2 Public Key Cryptography
It involves two pairs of keys: one for encryption and another
3.1 Symmetric Key Cryptography for decryption. Key used for encryption is a public key and
In symmetric Cryptography the key used for encryption is distributed. On the other hand key used for decryption is
similar to the key used in decryption. The sender and recipient private key. Unlike secret key cryptography, keys are not
of data must share same key and keep information secret shared. Instead, each individual has two keys: a private key
preventing data access from outside. Thus the key distribution that need not be revealed to anyone, and a public key that is
has to be made prior to the transmission of information. The preferably known to the entire world. We will use the letter e
key plays a very important role in symmetric cryptography to refer to the public key, since the public key is used when
since their security directly depends on the nature of key i.e. encrypting a message. Well use the letter d to refer to the
the key length etc. the bigger difficulty with this approach, is private key, because the private key is used to decrypt a
the distribution of the key. message. Encryption and decryption are two mathematical
functions that are inverses of each other.
Secret key cryptography schemes are generally categorized as
being either stream ciphers or block ciphers. Stream ciphers
operate on a single bit at a time and implement some form of
feedback mechanism so that the key is constantly changing. A
block cipher is so-called because the scheme encrypts one
block of data at a time using the same key on each block. In
general, the same plaintext block will always encrypt to the
same ciphertext when using the same key in a block cipher
whereas the same plaintext will encrypt to different ciphertext
in a stream cipher. There are various symmetric key
algorithms such as DES, TRIPLE DES, AES, RC4, RC6, and
BLOWFISH.

Fig 3: Public Key Cryptography

84 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security
National Conference on Research Issues in Image Analysis & Mining Intelligence (NCRIIAMI-2015)

In asymmetric or public key, cryptography there is no need In this scheme the government can decrypt some of the
for exchanging keys, thus eliminate the key distribution messages, but not all. Only p fraction of message can be
problem. The primary advantage of public-key cryptography decrypted and 1-p cannot be decrypted. This is advantageous
is increased security: the private keys do not ever need to be over key escrow or no key escrow cryptography as entire
transmitted or revealed to anyone. Can provide digital information is not at security risk.
signatures that can be repudiated
3.5.1 Limitation of Translucent Cryptography
3.2.1 Limitation Public Key Cryptography
Law enforcement may be frustrated that when it has an
A disadvantage of using public-key cryptography for authorized wiretap it is not getting decryption of all of
encryption is speed: there are popular secret-key the messages
encryption methods which are significantly faster than
any currently available public-key encryption method. Individuals may be frustrated that this scheme does not
provide absolute privacy for their messages, law
3.3. Hash Function enforcement can read some fraction of their messages.

Hash algorithms are also known as message digests or one- 4. APPLICATIONS OF


way transformations. A cryptographic hash function is a CRYPTOGRAPHY
mathematical transformation that takes a message of arbitrary
length and computes from it a fixed-length number. Cryptographic algorithms are widely being used to solve
Advantage of Hash Function is hash tables over other table problems belonging to data confidentiality, data integrity, data
data structures is speed. The advantage is more apparent when secrecy and authentication and various other domains. It uses
the number of entries is large. Hash tables are particularly various cryptographic algorithms as mentioned above as per
efficient when the maximum number of entries can be requirement of the action.
predicted in advance. If the set of key-value pairs is fixed and
known ahead of time, one may reduce the average lookup cost 4.1 Secure Message Transmission Using
by a careful choice of the hash function, bucket table size, Proxy-Signcryption
internal data structures.
The proxy signature schemes allow proxy signers to sign
3.3.1 Limitation of Hash Function messages on behalf of an original signer, a company or an
organization. It is based on the discrete logarithm problem.
Hash tables can be more difficult to implement than self- The signcryption is a public-key primitive that simultaneously
balancing binary search trees. In open-addressed hash performs the functions of both digital signature and
tables it is fairly easy to create a poor hash function. encryption.
A hash table takes constant time on average, the cost of a
good hash function can be significantly higher than the Integration of proxy signature and signcryption public key
inner loop of the lookup algorithm for a sequential list or paradigms provides secure transmission .It is efficient in
search tree. Thus hash tables are not effective when the terms of computation and communication costs. It is used for
number of entries is very small. low power computers in which a given device may transmit
The inability to use duplicate key when key value is and receive messages from an arbitrarily large number of
small. other computers.

3.4. Key Escrow Cryptography 4.2 Monitoring Communication


This technology allows the use of strong encryption, but also Cryptography can provide tremendously robust encryption; it
allows obtaining decryption keys held by escrow agents (third can impede the government's efforts to legitimately perform
party-entrusted key escrow). The decryption keys are split electronic reconnaissance. In order to meet this need ,key is
into parts and given to separate escrow authorities. Access to escrowed via entrusted third party .This technology allows the
one part of the key does not help decrypt the data; both keys use of strong encryption, but also allows the government
must be obtained. There are two secret keys, and decryption when legally authorized to obtain decryption keys held by
can be done with either one. One of the keys is used by the escrow agents. NIST has published the Escrowed Encryption
two parties; the other is stored, or escrowed, in some secure Standard as FIPS 185.
location, to be used if the original key is forgotten, or more
likely, by law enforcement if they suspect the two parties are
engaged in criminal activity. It is important to note that when 4.3 Fractional Observing of Data
key escrow encryption is used and law enforcement wants to
Sometimes sender wants only part of the message to be
use the escrowed key, they do have to obtain a warrant. This
monitored but not all. In that case Translucent cryptography is
is the kind of encryption that many people believe should be
used that explores the space between opaque (strong
all that is available to the general public.
encryption with no key escrow) and transparent (no
3.4.1 Limitation of Key Escrow Cryptography encryption or encryption with key escrow).With translucent
scheme, the government can decrypt some of the messages,
The user has to trust the escrow agents security but not all. Just as a translucent door on a shower stall
procedure, as well as the integrity of the people involved. provides some privacy, but not perfect privacy, translucent
He has to trust the escrow agents not to change their cryptography provides some communications privacy, but not
policies. The government not to change its laws, and perfect privacy. In this scheme the degree of translucency can
those with lawful authority to get his key to do so be controlled by varying parameter p.
lawfully and responsibly

3.5 Translucent Cryptography

85 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security
National Conference on Research Issues in Image Analysis & Mining Intelligence (NCRIIAMI-2015)

4.4 Transferring Files on Network


Files that are to be exchanged between users need to be
protected against malicious users and attackers. Symmetric
Key cryptographic uses only single key for both encryption
and decryption.
In this technology symmetric key is then encrypted with
public key which is associated with sender of file to obtain
encrypted file and this encrypted file is then send to receiver.
To decrypt the file, encrypted file system component driver
uses private key which is associated with receiver to decrypt
the symmetric key used to encrypt file. The encrypted file
system component driver is then uses symmetric key to
decrypt the file.

5. CONCLUSION
In this research different type of cryptography like Symmetric
key cryptography, Asymmetric key cryptography, Hash
function, Key escrow cryptography, and Translucent
cryptography limitation and its application is analysed. The
public-key cryptography is increased security and
convenience: private keys never need to transmitted or
revealed to anyone. In a secret-key system, by contrast, the
secret keys must be transmitted and there may be a chance
that an enemy can discover the secret keys during their
transmission. Hash tables can be more difficult to implement
than self-balancing binary search trees. In Key escrow
cryptography third party involved in encryption, translucent
cryptography overcome key escrow cryptography but it also
have some limitation. By removing these limitations we have
used this algorithm in various field of our life for safe data
transferring.

6. REFERENCE
[1] Shivangi Goyal A Survey on the Applications of
Cryptography International Journal of Science and
Technology Volume 1 No. 3, March, 2012
[2] Vishwa Gupta, Gajendra singh, Ravindra Guptha
Advance cryptography algorithm for improving data
security, IJARCSSE Volume2, Issue1, January 2012.
[3] Mohsin Khan, 2 Sadaf Hussain, Malik Imran
Performance Evaluation of Symmetric Cryptography
Algorithms: A Survey International Journal of Information
Technology and Electrical Engineering Volume 2, Issue 2
April 2013
[4] National Bureau of Standards, Data Encryption
Standard, FIPS Publication 46, 1977.
[5] William Stallings, Cryptography and Network security,
4th edition. Prentice Hall, November18 2006.
[6] Diaa Salma and Hateem Muhammad, "Evaluating The
Performance of Symmetric Encryption Algorithms,"
International Journal of Network Security.
[7] Monika Agrawal, Pradeep Mishra, A Comparative
Survey on Symmetric Key Encryption Techniques,
International Journal on Computer Science and Engineering
(IJCSE), Vol. 4 No. 05 May 2012.
[8] Ketu File with paper, Symmetric vsnAsymmetric
Encryption, a division of Midwest Research Coporation.

86 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security
National Conference on Research Issues in Image Analysis & Mining Intelligence (NCRIIAMI-2015)

A Novel Scheme for Handwritten Document using Image


Segmentation
Dr.S.Pannirselvam S.Prasath R.Keerthana
Research Supervisor & Head Assistant Professor M.Phil Research Scholar
Department of Computer Science Department of Computer Science Department of Computer Science
Erode Arts & Science College Erode Arts & Science College Erode Arts & Science College
Erode, Tamilnadu, India. Erode, Tamilnadu, India. Erode, Tamilnadu, India.
pannirselvam08@gmail.com softprasaths@gmail.com keerthanarcs@gmail.com

ABSTRACT form an edge map. Morphological operations are applied on


Image Segmentation plays a very necessary role in the processed edge map and further thresholding is applied to
image processing. Segmentation subdivides an image into improve the performance [4].
constituent regions are often obtained by using numerous Sunil Kumar et al., Proposed [5] Globally matched
ways some which are easier to understand. Edge detection wavelet filters with Fisher classifiers for text extraction from
techniques are largely used to find object based on local Document images & scene text images.
changes in intensity field of image processing applications. In A Gabor function based filter bank is used to
this paper studied new scheme on text image segmentation for separate the text and non text areas of comparable size [6]
English handwritten document is divided into numerous applied Delaunay triangulation for the extraction of text areas
various regions based on line segmentation, word in document page by representing the location of connected
segmentation and character segmentation. The experimental components in a document image with their centroids.
result shows that proposed methodology gives higher for
handwritten document segmentation. 3. METHODOLOGY
In this paper the segmentation is classified into Line
Keywords segmentation in which to identify the line in the documents.
Image segmentation, Histogram based Algorithm, Edge Word segmentation in which to identify the words in the
Detection algorithm Preprocessing, Image acquisition. documents. Character segmentation in which identify the
character in the documents. The goal of the segmentation is to
1. INTRODUCTION simplify or change the representation of the image into
Character recognition method comprised of something that is more meaningful and easier to analyze.
extracting the features of script and individual characters and
identification of characters based of features. The difficulties 3.1 Compression Based Algorithm
faced by these systems are variations in writing styles, heights Compression based algorithms postulate that the
& widths of characters, slanting angles. The major techniques optimal segmentation is the one that minimizes the overall
used for offline character recognition are neural network, possible segmentation, coding length of the data. The
support vector machines. The problem with these approaches connection between these two concepts is that segmentation
increase with increasing size of training knowledge. To solve tries to find patterns in an image and any regularity in the
this drawback need an optimization technique which will image can be used to compares it.
decreased the running time of the classifier.
3.2 Histogram based Algorithm
2. RELATED WORK This algorithm was implemented by Tony
Segmentation algorithms are area oriented instead Histogram based methods are very efficient when compared
of pixel oriented. The result of segmentation is the splitting up to other image segmentation methods because they typically
of the image into connected areas. Thus segmentation is requires only one pass through the pixel. In this technique,
concerned with dividing an image into meaningful regions. histogram is computed from the entire pixel in the image and
Image segmentation can be broadly classified into local and the peaks and valleys in the histogram are used to locate the
global segmentation [1]. cluster in the image. Intensity can be used as the measure. A
A study of image Segmentation and Edge Detection refinement of this technique is to recursively apply the
Techniques have been proposed by Punam Thakare. This histogram-seeking method to cluster in the image in order to
paper discussed about some image segmentation techniques divide them into smaller clusters. This process is repeated
like edge based, region based and integrated techniques. The with smaller and smaller cluster until no more cluster are
results show that the recognition rate depends on the type of formed. One of the disadvantages of histogram seeking
the image and their ground Truths [3]. method is that it may be difficult to identify significant peaks
A methodology for extracting text from images such and valleys in the images.
as document images, scene images etc by Neha Gupta, V .K.
Banga. This paper employs discrete wavelet transform for 3.3 Edge Detection Algorithm
extracting text information from complex images. The input Edge detection is well developed field on its own
image may be a colour image or a grayscale image. If the within image processing. The region Boundaries and edge are
image is colour image, then preprocessing is required. For closely related, since there is often a shape adjustment in
extracting text edges, the sobel edge detector is applied on intensity at the region boundaries. Edge detection techniques
each sub image. The resultant edges so obtained are used to have therefore been used as the base of another segmentation

87 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security
National Conference on Research Issues in Image Analysis & Mining Intelligence (NCRIIAMI-2015)

technique. The edge identified by edge detection is often


disconnected. To segment an object from an image however,
one needs closed region boundaries.

3.4 Region based methods


This approach uses the properties of the color or
gray scale in a text region or their differences regarding the b. Symbols
background. This method is basically divided in two sub
categories: edge based and connected component (CC) based
methods. The edge based method is mainly focus on the high
contrast between text and background. In this method, firstly
text edges are identified in an image and are merged. Finally,

some heuristic rules are applied to discard non-text regions. c. Function Keys
Connected component based method considers text as a set of
separate connected components, each having distinct intensity
and color distributions. The edge based methods are robust to
low contrast and different text size where as CC based
d. Numbers
methods are somewhat simpler to implement, but they fail to
Fig.2 Sample Handwritten text
localize text in images with complex backgrounds.
4.2 Segmented Image
3.5 Steps in Image segmentation Image segmentation is the process of partitioning a
The general steps that are involved in Image digital image into multiple segments. Images are segmented
segmentation systems are, into line, word and character for the given preprocessing input
image.
Read Image
4.2.1 Line segmentation
The images are segmented row-wise (line
segmentation). The resulted images of the line segmentation
Segmentation are shown in fig.3.



Line Word Character
a. Capital Alphabets
Segment Segment Segment

Fig.1 Steps in image segmentation




4. Experimental Results b. Symbols
The proposed method is experimented with the
images collected from the own handwritten database
consisting of 1000 images. From the below Table.1 shows
that recognition percentage of document image gives higher

retrieval accuracy.
c. Function Keys
4.1 Sample Image
The sample image are captured and shown in the
fig.2 as follows.

d. Numbers

Fig. 3 Line Segmented image

4.2.2 Word segmentation


In the line segmented image each word is
segmented. The following fig.4 and shows the word
a. Alphabets segmentation.

a. Capital Alphabets

88 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security
National Conference on Research Issues in Image Analysis & Mining Intelligence (NCRIIAMI-2015)

8. Q. Liu, C. Jung, and Y. Moon, Text segmentation based


on stroke filter, in Proc. Int. Conf. Multimedia, pp.129
132, 2006.
b. Small Alphabets

Fig.4 Word Segmented image

4.2.3 Character Extraction


The characters extracted from the word in the
captured images are shown in fig.5 as follows,

a. Capital Alphabets

b. Small Alphabets

Fig.5 Characters Extracted

5. Conclusion
The proposed image segmentation methods have
been tested on number of documented image and hand written
images. The use a set of quantitative evaluation measurements
for the image segmentation .The system is designed in such a
way the text in documented image is detected and segmented
automatically. Line segmentation is done by using horizontal
projection profile and vertical projection profile analysis.

6. References
1. Igor Tchouchenkov, Heinz Wrn, Optical Character
Recognition Using Optimisation Algorithms,
Proceedings of the 9th International Workshop on
Computer Science and Information Technologies
CSIT2007, Ufa, Russia, 2007.
2. Mantas Paulinas and Andrius Usinskas, A Survey of
Genetic Algorithms Applicatons for Image Enhancement
and Segmentation, Information Technology and Control,
Vol.36, No.3, pp.278-284, 2007.
3. Punam Thakare, A Study of Image Segmentation and
Edge Detection Techniques, International Journal on
Computer Science and Engineering Vol. 3 No. 2 , 2011.
4. Neha Gupta, V.K. Banga, Image Segmentation for Text
Extraction, ICEECE2012, 2012.
5. Sunil Kumar, Rajat Gupta, Nitin Khanna, Santanu
Chaudhury, and Shiv Dutt Joshi, Text Extraction and
Document Image Segmentation Using Matched Wavelets
and MRF Model , IEEE Transactions On Image
Processing, Vol.16, No.8, pp.2117-2128, 2007.
6. Yi Xiao,Hong Yan, Text region extraction in Document
image based on the Delaunay tessellation, The Journal of
Pattern Recognition Society, 2003.
7. H. Tran, A. Lux, H. L. Nguyen, and A. Boucher, A novel
approach for text detection in images using structural
features, in Proc. 3rd Int. Conf. Adv. Pattern
Recognition., pp. 627635, 2005.

89 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security
National Conference on Research Issues in Image Analysis & Mining Intelligence (NCRIIAMI-2015)

A Selective Analysis of Imputation Methods for Collaborative


Filtering System
M. Suguna Dr.P.Raajan
Research Scholar Assistant Professor
Ayya Nadar Janaki Ammal College Ayya Nadar Janaki Ammal College
Sivakasi Sivakasi
shivahemaprithi@gmail.com raajanp99@gmail.com

ABSTRACT
Due to the rapid development in the information technology, recommendation, similarity measures, evaluation metrics used
online business has been emerged and the products sale has in CF system.
been accomplished through various websites. The Billsus D et al. [3] specified that since the correlation
recommender system helps the users to select the valuable coefficient is only defined between customers who have rated
products. In such system, sparsity is the major problem. In at least two products in common, many pairs of customers
this paper, we propose a methodology to preprocess the data. have no correlation at all. Su et al. [4] used PMM (predictive
and two methods are used to impute missing values and mean matching, a state-of-the-art imputation technique) to
finally we select the best method based on time complexity. impute the missing values.
Ghazanfar M A et al. [5] proposed nearly 17 approaches to
General Terms approximate the missing values in the user-item rating
Imputation, Row elimination, Column elimination matrix.The main approaches are based on various classifiers
such as k-nearest neighbor, Nave bayes and support vector
Keywords machine.
Collaborative Filtering (CF), Singular Value Decomposition Raiko T et al. [6] proposed a gradient descent algorithm
(SVD), Principal Component Analysis (PCA). inspired by Ojas rule, and speeded it up by an approximate
Newtons method. Noha H et al. [7] proposed multiple
1. INTRODUCTION imputation based collaborative filtering to solve the sparsity
Information Technology is the application of computers and problem..
telecommunications equipments used to transmit and Chen Y et al. [8] solved the sparsity problem in recommender
manipulate data, often in the context of a business or other system by retrieving association between users.
enterprise. Kumar B D et al. [9] explained the role of matrix factorization
Data Mining (DM) is an analytic process designed to extract methods such as SVD, PCA, Probabilistic matrix factorization
useful information from large amounts of data. It includes in recommender system.
various techniques such as preprocessing, supervised learning
methods and unsupervised learning methods. 3. METHODOLOGY
Recommender system is a system that provides suggestions to 3.1 Preprocessing Framework
users what product to choose. Such system is otherwise called It is one of the important data mining tasks. As movie dataset
as recommended system. Mainly there are three types of is a sparse dataset, it is better to preprocess the data. The
recommender systems such as Content based, Collaborative proposed preprocessing technique involves three steps as
filtering (CF) and hybrid system. The collaborative filtering shown in figure 1.
system can be categorized into two as user-based and item-
based.
In user-based CF system, the ratings given by the users are
represented as (M x N) user-item matrix whose elements are
the ratings given by M users for corresponding N items.
Usually the user-item matrix is a sparse matrix. This paper has
been organized as follows. Section 2 discusses the related
works in detail. Section 3 discusses about data format, row
elimination, column elimination and imputation of missing
values. Section 4 discusses the imputation methods in detail
and analyze it. Section 5 presents the pseudo code for the
preprocessing technique. Section 6 discusses the experimental
results. Section 7 we conclude the analysis. Section 8
specifies the future enhancement.
2. RELATED WORKS
Nilashi M et al. [2] presented the basic concepts such as data
format, CF techniques, CF tasks such as prediction and
Fig 1: Steps involved in Preprocessing Technique

90 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security
National Conference on Research Issues in Image Analysis & Mining Intelligence (NCRIIAMI-2015)

3.2 Row And Column Elimination The new data matrix X contains most of the information of
Row elimination is a process used to eliminate the users who the original X with a imputed values.
have rated less than 100 movies of a total of 1682 movies. 3.5 ALGORITHM FOR
Column elimination is a process used to eliminate the movies
that have been rated by less than 50 users of a total of 943
PREPROCESSING TECHNIQUE
The algorithm for the preprocessing technique used in this
users.
paper is as follows:
3.3 Imputation of Missing Values
The movie dataset consists of 1486126 (93.7%) missing
values. Using such data set for further data mining techniques Input:
such as classification, clustering and association rule mining Dataset with missing values
leads to poor result. So to improve the accuracy, we impute Output:
the missing values. The missing values can be filled by
various methods such as Singular Value Decomposition and Dataset with imputed values
Principal component analysis. Procedure:
Step 1: Eliminate rows with ratings < 100
3.4 IMPUTATION METHODS
Step 2: Eliminate columns with ratings < 50
3.4.1 Singular Value Decomposition (SVD) Step 3: Perform Singular value decomposition to
Usually the researchers used SVD for dimensionality
reduction. But SVD can be used to impute the missing values impute the missing values and solve
in the dataset of CF system. sparsity problem

The core of the SVD algorithm lies in the following theorem:


It is always possible to decompose a given matrix A into
A=UVT . 4. EXPERIMENTAL RESULT
Given the nm matrix data A (n items, m features), we can
obtain an nr matrix U (n items, r concepts), an rr diagonal 4.1 Data Source
matrix (strength of each concept), and an mr matrix V (m MovieLens 100k dataset consists of 100,000 ratings (1-5)
features, r concepts). The diagonal matrix contains the from 943 users on 1682 movies [1]. This dataset consists of 3
singular values, which will always be positive and sorted in files such as users, movies and ratings. As our research is
decreasing order. The U matrix is interpreted as the item-to- related to ratings, we take the ratings file alone for
concept similarity matrix, while the V matrix is the term-to- consideration which consists of four attributes (User id,
concept similarity matrix. The imputed data will be present
Movie id, Ratings, Timestamp). As timestamp is same for
in U as shown in figure 2.
most records, we ignored that attribute. The sample data set is
given in table 1.

Table 1. Sample Movie Dataset

Movie 1 Movie2 Movie3 Movie4


User1 1 3
User 2 5
User 3 3 2
User 4 3 1
User 5 2 3

4.2 Result Of Row Elimination and Column


Elimination
The original data size is 4.75 MB. Table 2 shows the number
Fig 2: SVD for Imputation of users and movies, size of data after row elimination and
column elimination and the time taken for each process
Table 2. Row And Column Elimination
3.4.2 Principal Component Analysis (PCA)
Principal Component Analysis was first introduced in 1901 Time
by Karl Pearson [10].PCA allows us to retrieve the original Task
No. of No. of Data
Taken
data matrix by projecting the data onto the new coordinate Users Movies Size
(Msec)
system Row
361 1682 1.92MB 0.5460
Elimination
Xnm = XnmWmm.
Column
361 593 701KB 0.0312
Elimination

91 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security
National Conference on Research Issues in Image Analysis & Mining Intelligence (NCRIIAMI-2015)

4.3 Result Of Imputation Using SVD Recommender Systems Workshop. Tech. Report WS-98-
The result of the imputation methods such as SVD and PCA 08, AAAI Press
are given in the table 3. [4] Su, X., Khoshgoftaar, T. M., Zhu, X., & Greiner, R.
Table 3. Imputation result 2008, March. Imputation-boosted collaborative filtering
using machine learning classifiers. In Proceedings of the
Imputation Time Taken 2008 ACM symposium on Applied computing (pp. 949-
Remarks
Method (Msec) 950). ACM.
SVD 0.69 Efficient
Bad in time taken and [5] Ghazanfar, M. A., & Prgel-Bennett, A. 2013. The
PCA 130.60 increases number of Advantage of Careful Imputation Sources in Sparse
columns Data-Environment of Recommender Systems:
Generating Improved SVD-based
Recommendations. Informatica (Slovenia),37(1), 61-92.
Among the two imputation methods, we considered the [6] Raiko, T., Ilin, A., & Karhunen, J. 2008, January.
methods such as SVD and PCA. Singular Value Principal component analysis for sparse high-
decomposition method is the best method because it takes less dimensional data. In Neural Information Processing(pp.
time, does not reduces or increases the dimension. In this 566-575). Springer Berlin Heidelberg.
paper, we concentrated on the sparsity problem in CF system. [7] Noha, H., Kwak, M., & Hanc, I. 2003. Handling
The sparsity has been solved using the preprocessing incomplete data problem in collaborative filtering
technique in this paper as shown in figure 3. system. Journal of Intelligent Information Systems, 9(2),
51.
[8] Chen, Y., Wu, C., Xie, M., & Guo, X. 2011. Solving the
Sparsity Problem in Recommender Systems Using
Association Retrieval. Journal of computers,6(9), 1896-
1902.
[9] kumar Bokde, D., Girase, S., & Mukhopadhyay, D.
2014. Role of Matrix Factorization Model in
Collaborative Filtering Algorithm: A Survey.
International Journal of Advance Foundation and
Research in Computer (IJAFRC) Volume 1, Issue 12, pp.
2348 4853
[10] Pearson, K. 1901. LIII. On lines and planes of closest fit
to systems of points in space. The London, Edinburgh,
Fig 3 : Levels Of Sparsity and Dublin Philosophical Magazine and Journal of
Science, 2(11), 559-572.

5. CONCLUSION
In this paper we discussed the preprocessing framework
which consists of three tasks such as row elimination, column
elimination and imputation of missing values. We also
compared two imputation methods such as SVD and PCA.
We selected SVD as the best method to impute missing
values. We conclude that the sparsity problem in CF system
can be solved by using this framework.

6. FUTURE ENHANCEMENT
In future, this preprocessed movie dataset can be used for
prediction of ratings for new users and recommendation of
movies. It will lead us to attain good accuracy of prediction
and recommendation. This preprocessing technique can be
applied to other data sets such as Jokes, Books etc.

7. REFERENCES
[1] http://www.movielens.org
[2] Nilashi.M., Bagherifard.K., Ibrahim.O., Alizadeh.H.,
Nojeem.L.A and Roozegar.N., 2013.Collaborative
Filtering Recommender Systems, Research Journal of
Applied Sciences, Engineering and Technology,(
vol.5,no.16,pp. 4168-4182)
[3] Billsus, D., and Pazzani, M. J. 1998. Learning
Collaborative Information Filters. In Proceedings of

92 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security
National Conference on Research Issues in Image Analysis & Mining Intelligence (NCRIIAMI-2015)

Discretization in Mining using Binning Method


Dr.K.Meenakshisundaram C.Veerakumar M.Menaka
Associate Professor, Ph.D-Research Scholar, Ph.D-Research Scholar,
Department of Computer Science, Department of Computer Science, Department of Computer Science,
Erode Arts and Science College, Erode Arts and Science College, Erode Arts and Science College,
Erode. Erode. Erode.
lecturerkms@yahoo.com veerakumarct@gmail.com menarameac@gmail.com

ABSTRACT highly too noisy, missing, and inconsistent data due to their
Data mining is the non-trival process as identifying valid, huge size and multiple sources. Low quality data will lead to
novel, potentially useful and ultimately understandable low-quality mining results. So, we prefer a preprocessing
patterns in data. Data mining is defined as the automatic concept:
extraction of unknown, useful and understandable patterns
from large DB. Utilizing these massive volumes of data is
becoming a major problem for all enterprises. Raw data is
highly susceptible to noise, missing values, and inconsistency.
The quality of data affects the data mining results. In order to
help improve the quality of the data and consequently, of the
mining results raw data is pre-processed so as to improve the
efficiency. Noise is a random error or variance in a measured
variable. Noise is removed using data smoothing techniques.
This paper provides a study on data smoothing techniques to
removed noise by using Binning method in data mining.

Keywords
Data Mining Techniques, Data pre-processing, Discretization,
Binning method

1. INTRODUCTION
The development of Information Technology has
generated large amount of databases and huge data in various
areas. The research in databases and information technology
has given rise to an approach to store and manipulate this
precious data for further decision making. Data mining is a Fig 2. Forms of Data Pre-processing
process of extraction of useful information and patterns
from huge data[3]. It is also called as knowledge discovery
process, knowledge mining from data, knowledge extraction 3. TASKS IN DATA PREPROCESSING
or data /pattern analysis. 3.1 Data Cleaning
Fill in missing values, smooth noisy data, identify or
remove outliers, and resolve inconsistencies.
Data cleaning tasks are
Fill in missing values
Identify outliers and smooth out noisy
data
Correct inconsistent data
Resolve redundancy caused by data
integration

3.2 Data Integration


Integration of multiple databases, data cubes, or
files. Schema integration Integrate metadata from different
sources. Entity identification problem: identify real world
entities from multiple data sources, e.g., A.cust-id B.cust-#.
Fig.1 Architecture of Data Mining Detecting and resolving data value conflicts. For the same real
world entity, attribute values from different sources are
2. DATA PREPROCESSING different
Data pre-processing is a data mining technique Possible reasons: different representations, different
that involves transforming raw data into an understandable scales. Handling Redundancy in Data Integration. Redundant
format. Real-world data is often incomplete, inconsistent, data occur often when integration of multiple databases.The
and/or lacking in certain behaviours or trends, and is likely same attribute may have different names in different databases
to contain many errors. Data preprocessing is a proven Careful integration of the data from multiple sources may help
method of resolving such issues. Real world databases are

93 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security
National Conference on Research Issues in Image Analysis & Mining Intelligence (NCRIIAMI-2015)

reduce/avoid redundancies and inconsistencies and improve 5.2 Clustering


mining speed and quality.
Outliers may be detected by clustering,where
3.3 Data Reduction similar values are organized into groups,or
clusters.Intuitively,values that fall outside of the set of clusters
The Data reduction techniques can be applied to obtain may be considered outliers.
reduced representation in volume but produces the same or
similar analytical results. 5.3 Combined computer and human
Data Reduction Strategies: inspection
Data Cube aggregation Detect suspicious values and check by human (e.g.,
Attribute subset selection deal with possible outliers)
Dimensionality reduction
Numerosity reduction 5.4 Regression
3.4 Data Transformation Data can be smoothed by fitting the data to a
In data transformation, the data are transformed or function, such as with regression. Linear regression involves
consolidated into forms appropriate for mining. finding the best line to fit two attributes(or variables),so that
one attribute can be used to predict the other me. Multiple
Data Transformation can involve the following: linear regressions is an extension of linear regression, where
Smoothing more than two attributes are involved and the data are fit to a
Aggregation multidimensional surface.
Generalization of the data
Normalization
6. IMPACT OF BINNING METHOD IN
Attribute construction DISCRETIZATION
Discretization of continuous attributes is one of the
4. DATA CLEANING important data pre-processing steps of knowledge extraction.
Real world data tend to be incomplete, noisy and An effective discretization method not only can reduce the
inconsistent. Data cleaning routines attempt to fill in missing demand of system memory and improve the efficiency of data
values, smooth out noise while identifying, and correct mining and machine learning algorithm, but also make the
inconsistencies in the data. knowledge extracted from the discretized dataset more
compact, easy to be understand and used.
4.1 Missing value
Many tuples have no recovered values for several 6.1 Features of Data Discretization
attributes, such as customer income. So we can fill the
Leads to a concise.
missing values for this attribute.
Easy-to-use.
a) Ignore the tuble Knowledge-level representation of mining results.
This method is not very effective unless the tuple
contains several attributes with missing values.
6.2 The Role of Binning
Binning is a top-down splitting technique based on a
b) Fill in the missing value manually specified number of bins. These methods are also used to
This approach is time-consuming and may not be discretization methods for numerosity reduction. These
feasible given a large data set with many missing values. techniques can be applied recursively to the resulting partitions.
Binning does not use class information.
c) Use a global constant to fill in the missing value
6.2.1 Equal-width (distance) partitioning
Replace all missing attribute values by the same
constant, such as a label like UNKNOWN (or) infinitive. Divides the range into N intervals of equal size:
uniform grid
4.2 Noisy Data if A and B are the lowest and highest values of the attribute,
Noise is a random error or variance in a measured the width of intervals will be:
variable. Noise is removed using data smoothing techniques.
W = (B A)/N.
4.3 Inconsistent value
The most straightforward, but outliers may dominate
Inconsistencies exist in the data stored in the presentation Skewed data is not handled well.
transaction. Inconsistencies occur due to errors during data
entry, functional dependencies between attributes and missing 6.2.2 Equal-depth (frequency) partitioning
values. Divides the range into N intervals, each containing
approximately same number of samples
5. HANDLING NOISY DATA
Good data scaling
5.1 Binning method Managing categorical attributes can be tricky.
First sort data and partition into (equi-depth)
bins.Then one can smooth by bin means, smooth by bin
median, smooth by bin boundaries, etc.

94 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security
National Conference on Research Issues in Image Analysis & Mining Intelligence (NCRIIAMI-2015)

6.2.3 Binning methods for Data Smoothing


They smooth a sorted data value by consulting its
neighbourhood, that is the values around it.

7. CONCULSION
Data preparation is an important issue for both data
warehousing and data mining, as real world data tends to be
incomplete, noisy, and inconsistent. Data preparation includes
data cleaning, data integration, data transformation, and data
reduction. Data cleaning routines can be used to fill in missing
values, smooth noisy data, identify outliers, and correct data
inconsistencies. A lot of methods have been developed but
still an active area of research.

8. REFERENCES
[1]. Crisp-DM 1.0 Step by step Data Mining guide from
http://www.crisp dm.org/CRISPWP- 0800.pdf.
[2]. Gregory Piatetsky-shapiro Kdnuggest
http://www.kdnuggest.com/data mining concepts/
[3]. Gupta G.K Introduction to data mining with case
studies
[4]. Jewie Han and Micheline kamber, Data mining
concepts and technologies,published by Morgan
Kauffman,2nd Ed,2006.
[5]. Liu.H and setiono.Feature.R Selection and
discretization.IEEE Transaction on knowledge and
Data Engineering, 9:1-4, 1997.
[6]. Pyle.D Data Preparation for Data Mining. Morgan
Kaufmann, 1999.
[7]. Richard J.Roiger and Michal W.Geatz Data Mining
A Tutorial-based Primer
[8]. Tan Pang-Ning, Steinbach, M., Vipin Kumar.
Introduction to Data Mining, Pearson Education,
New Delhi, ISBN: 978-81-317-1472-0, 3rdEdition,
2009.

95 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security
National Conference on Research Issues in Image Analysis & Mining Intelligence (NCRIIAMI-2015)

A Content based Image Retrieval with Shape Feature


Extraction
Dr.S.PANNIRSELVAM R. N. MUHAMMAD ILYAS
Department of Computer Science Ph.D Research Scholar
Erode Arts & Science College, Erode M.S.University, Tirunelveli.
pannirselvam08@gmail.com ilyas_rn@yahoo.com

ABSTRACT discontinuities in contour representation This approach


Content based image retrieval (CBIR) uses the described has been proved to be simple and effective for
visual contents to images with low level features like color, shape based image retrieval for salient points method.
shape, texture and spatial layout to represent and index the Anju et al. [ANJ14] developed a novel technique for
image. Content based image retrieval (CBIR) is a technique image recovery based on the integrating a contour, texture,
use to search images from the large scale image databases. colour, edge and spatial features. Contourlet decomposition is
CBIR methods suffer from the problems of high to extracting the contour features. Directionality and
dimensionality leading to more computational time and anisotropy are properties of contourlet transformation is used
inefficient indexing and retrieval performance. This paper to efficient technique.
focuses on a low-dimensional shape based indexing technique M.Risvanaet al. [RIS14] enhanced the novel
for achieving efficient and effective retrieval performance. A approach using Local Tetra Pattern along with vertical,
simple index based on shape features of square regions that horizontal, diagonal directions with the colour features. The
are segmented out of images. Each region within an image is Local Tetra Pattern encodes with images based on the
then indexed by a region-based shape index. The shape index direction of pixels that are calculated by horizontal, vertical
is invariant to translation, rotation and scaling. The retrieval derivatives and diagonal.
performance is compared with region-based shape-indexing Simily et al. [SIM11] proposed an multiple image
scheme. queries for finding desired images from database are
connected using logical AND operation. Local Binary Pattern
1. INTRODUCTION texture descriptors of an query images are extracted and
The past few years have seen many advanced features are compared with the features of the images in
techniques evolving in Content-Based Image Retrieval database for finding the desired images.
(CBIR). Applications like medicine, entertainment, education, Reshma et al. [RES12] proposed an algorithm for
manufacturing, etc. make use of vast amount of visual data in Colour Coherence Vector and the speed of shape based
the form of images. This envisages the need for fast and retrieval can be enhanced by considering approximate shape.
effective retrieval mechanisms in an efficient manner. A Sonya et al. [SON13] proposed a feature extraction
major approach directed towards achieving this goal is the use technique is based on a rectangle will cover a shape of an
of low-level features of the image data to segment, index and image. It is invariant against translation, scaling and rotation.
retrieve relevant images from the image database. Mehwish et al.[MEH12] discussed a visual contents
of an image features related to low level after extracting from
2. RELATED WORK image that are colour, texture and shape. The algorithms used
Gulfishand et al. [GUL11] proposed the semantic- for feature extraction and relevance feedback to bridge
based image retrieval is a better way to solve the semantic gap extracted from low level features and features with high level
problem compared with the many image retrieval techniques semantics gap from image.
based on color, texture, shape and semantic. The better B.G. Prasad et al. [PRA03] developed a technique
retrieval results in semantics based image retrieval increases. to retrieve images by region matching using combined feature
Niket et al. [NIK13] proposed a Region Based index based on colour, shape and location is presented the
Image Retrieval (RBIR) system uses the Discrete Wavelet framework MPEG-7. Dominant regions within each image are
Transform (DWT) and k-means clustering algorithm to indexed using integrated colour, shape and location features.
segment an image into regions. Each region of an image is Cheng et al. [CHE13] presented a image retrieval
represented by a set of optical characteristics between regions method based on region shape similarity. In this approach first
is measured using a particular function. segment images into primitive regions and then combine some
Abby. A [ABB00] presented the indicating cross- primitive regions to generate meaningful composite shapes
disciplinary approaches utilizing both text and image features are used semantic uni ts of an images in the similarity
for retrieval. assessment.
K. Konstantinidis et al. [KON05] proposed a new Sharadh Ramaswamy et.al [SHA09] described a fast
fuzzy linking method of colour histogram creation method clustering-based indexing technique. In this method relevant
based on colour space. It creates are was assessed based on clusters are retrieved till the exact nearest neighbors are
the performance in retrieving similar images from the image found. It enables efficient clustering with low preprocessing
collection. storage.
Yong et al. [YON99] presented a visual feature Stricker et al. [MSO95] proposed a new color
extraction, multidimensional indexing and system design are indexing techniques. The first is more robust version of the
achieved by using CBIR. commonly used color histogram indexing. The index to store
S.Arivazhagan et al. [ARI07] proposed contour the cumulative color histograms. This method is robust with
based approach. Canny operator is used to detect the edge respect to quantization parameter of an histograms. The next
points of an image. The contour of an image is traced by approach to color indexing, instead of storing the complete
scanning the edge image and re-sampling is done to avoid the color distributions it contains only dominant features.

96 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security
National Conference on Research Issues in Image Analysis & Mining Intelligence (NCRIIAMI-2015)

Ying Liu et al. [YLI07] applied a method to 3.2 PROPOSED METHODOLOGY


improve the retrieval accuracy of an content-based image
The proposed methodology is developed using the
retrieval systems from low-level feature extraction algorithms
local binary pattern and the Gaussian filter for image
to reduce the semantic gap between the visual features and
enhancement the following figure3.4 shows the flow diagram
human semantics.
of the proposed system.
3. FEATURE EXTRACTION
The characteristics of features in the objects are
represents the relevant information of an image a complete Query Image
characterization. The feature extraction to analyze objects and Image Collection
images to extract the prominent features are represented with
various classes of objects. The features are used as input to
classifiers to assigned. Pre-processing Pre-processing

3.1 SMALLEST RECTANGLE


CENTROID DISTANCE (SRCD) Feature Extraction Feature Extraction
SIGNATURE
A new shape signature is based on the SRD
signature. The first is to identify the smallest rectangle that
covers a shape are extracted. Then consider p points in each Query Image Similarity
Feature
length sides and q points in each width sides of a rectangle. Feature Matching
The selected points with same distance to each other and the Database
feature vector of are more complex. The two values of each
one those 2(p + q) selected points on the rectangle's surface
should be obtained. Retrieved
Images
1. The real part of the distance between a rectangle's
point (xr ,yr) and the opposite point of an shape's boundary (xt ,
yt) is called dRectangle. Fig 3.1 Proposed CBIR Model
2. The imaginary part of the distance between the (xt
, yt) point and the centroid point of the shape (xc , yc) is called 3.3 SHAPE FEATURES
dCentroid. Searching for images using shape features has
The SRD signature only dRectangle are used for attracted much attention. Shape representation and
feature extraction and the feature vector of smallest rectangle description is a difficult task. This is because when a 3-D real
centroid distance (SRCD) is proposed. world object is projected onto a 2-D image plane, one
dimension of object information is lost. As a result, the
= + = 1, 2, (2 + 2) shape extracted from the image only partially represents the
projected object. To make the problem even more complex,
From the rectangle covers a shape, the SRCD shape is often corrupted with noise, defects, arbitrary
signature is not sensitive against translation. The SRCD distortion and occlusion.
signature are invariant to scale when feature vectors element An important step before shape extraction is edge
are normalized. To normalize the real part of the SRCD, it point detection. Shape is the importance feature for
must be divided by length (L) or width (W) of an rectangle. recognition of objects in an image. There are two classes of
The imaginary part of SRCD must be divided to R. whereas R techniques in shape based retrieval. Region based techniques
is distance between the centroid of shape and the extreme and boundary based techniques. A region based technique
point on the shape boundary. The scale normalization of uses whole shape region but a boundary based technique only
feature vector SRCD signature as uses boundary points of shapes in feature vector extraction.
Region based techniques often involve intensive computations
and fail to distinguish between objects that are similar. Thus
boundary-based techniques are more efficient than region
based techniques. Several number of techniques have
presented that are based on boundary of shapes.

3.4 LOCAL BINARY PATTERN (LBP)


The Local Binary Pattern operator was introduced
for texture classification. Given a centre pixel in the image,
the LBP value is computed by comparing its gray value with
its neighbors. LBP method provides a robust way for
When two shapes of same class have different
describing pure local binary patterns in a texture. The original
directions, the SRCD feature vectors of them have different
33 neighborhood threshold by the value of the center pixel.
directions, too. By getting discrete Fourier transform of the
This threshold neighborhood pixel values are multiplied by
SRCD signature, it will be invariant to rotation.
binomial values of the corresponding pixels. Resulting pixel
value is summed for the LBP number of this texture unit. LBP
method is gray scale invariant and can be easily combined
with a simple contrast measure by computing for each
neighborhood the difference of the average gray level of those

97 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security
National Conference on Research Issues in Image Analysis & Mining Intelligence (NCRIIAMI-2015)

pixels which have the value 1 and those which have the value
Algorithm: II
0 respectively as shown in Figure.3.2
Input : Target image
Output: Top n relevant images
// Top n images retrieval //
Step 1: Select target image Ti of size() from the image database.
Step 2: Apply step 1 through step 5 of algorithm I.
Step 3: Compute the Euclidean distance for the target image and
the image in the IDB using the Equation--- (7)
Step 4: List top n relevant images.
Step 5: Stop.

Fig.3.2 Local Binary pattern


5. Similarity and Performance Measures
In the texture based image retrieval system
Euclidean distance is used to find distance between the
features vectors of the target image It and each image in the
image database (Ii). The difference between two images Ii and
It can be expressed as distance d between the respective
feature vectors Fs(Ii) and Fs(It). From the given input image Ii
and target image It the Euclidean Distance d E ( Fs ( Ii ) , Fs ( It ) ) is
calculated as,

( Fs ( I ) , Fs ( I )) = ( Fs ( I ) Fs ( I ))
i=n 2
dE i t i t
i =1

Where Fs(Ii) is the feature set of the input image Ii,


Fs(It) is n-dimensional feature vector of the target image It
respectively.

Fig.3.3 Calculating LBP Operator Existing Color &
4. ALGORITHM Shape
Proposed Shape
The proposed algorithm describes the extraction of Category
feature vector for an efficient image retrieval is been presented as (Histogram)
follows.
Precision Recall Precision Recall
AlgorithmI
Beach 0.45 0.11 0.875 0.18
Input : Input Image.
Output: Feature set generation. Building 0.61 0.14 0.8 0.13

// Generation of feature vectors// Car 0.33 0.08 0.75 0.07


Step 1: Select on Input image Q of size m x n form the
Dinosaur 0.45 0.1 0.51 0.06
image data base
Step 2: Rescale the image to 128x128 Pixels and apply Gaussian Horse 0.38 0.08 0.74 0.09
filter for image enhancement
Rose 0.44 0.1 0.83 0.2
Step 3: Apply Local Binary Pattern operator was introduced for
texture classification.
AVG 0.44 0.1 0.75 0.121
Step 4: Generate feature vector of the input image Q
Step 5: Repeat Step 1 through Step 4 for all image in the IDB TABLE 1: Performance Comparison for
Shape Based Image Retrieval
Step 6: Stop
6. PRECISION
Precision measures the fraction of retrieved images
that are relevant to a specific query, and is analogous to
positive predictive value. To estimate the precision, we
randomly selected 250 concepts among those that appeared in
the collection. For each concept, we selected a random sample

98 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security
National Conference on Research Issues in Image Analysis & Mining Intelligence (NCRIIAMI-2015)

of up to five figure captions in which MMTx identified the 9. REFERENCES


concept as present. Those captions were reviewed manually to [ABB00] Abby.A.Goodrum. Image Information Retrieval:
determine if the caption was indexed by specified concept An Overview of Current Research, International
correctly (true positive [TP]) or incorrectly (false positive Journal of Imaging Science and Engineering
[FP]). We computed the precision as the number of captions (IJISE), 2000.
correctly indexed (TP) divided by the total number of captions [ANJ14] Anju Maria, Dhanya S. Amalgamation of
indexed (TP + FP). We calculated the 95% confidence Contour, Texture, Color, Edge, and Spatial
interval for precision based on the size of the sample. Features for Efficacious Image Retrieval,
IJRET: International Journal of Research in
Engineering and Technology, Feb-2014.
[ARI07] S.Arivazhagan,L.Ganesan,S.Selvanidhyananthan,
Image Retrieval using Shape Feature,
7. RECALL International Journal of Imaging Science and
Recall measures the fraction of all the relevant Engineering (IJISE), July 2007.
documents in a collection that are retrieved by a specific [CHE13] Cheng Chang, Liu Wenyin, Hongjiang Zhang.
query, and is akin to the concept of sensitivity. Here, recall is Image retrieval based on region shape similarity,
the number of figure captions that were indexed by a concept Lecture Notes on Software Engineering,
divided by the number of captions in which the concept was November 2013.
actually present. We estimated recall by sampling concepts [GUL11] GulfishanFirdose Ahmed, RajuBarskar. A Study
and captions. We randomly selected 40 concepts, each of on Different Image Retrieval Techniques in
which MMTx had indexed in more than 10 figure captions. Image Processing, International Journal of Soft
For each concept, the true positive (TP) value was estimated Computing and Engineering (IJSCE), September
as the total number of positive captions (those indexed by 2011.
that concept) multiplied by the overall precision value. [KON05] K. Konstantinidis , A. Gasteratos, I. Andreadis.
Then, for each concept, we sampled 25 figure Image Retrieval Based on Fuzzy Color Histogram
captions from among those that were not indexed by that Processing, Optics Communications, 2005.
concept and reviewed those concept-caption pairs. Those [MEH12] MehwishRehman, Muhammad Iqbal, Muhammad
captions should be negative; the Sample TN is the number Sharif and Mudassar Raza.Content Based Image
of true negative (TN) figures among the 25 sampled for each Retrieval: Survey, International Journal of
concept. Based on the Sample TN value, we extrapolated to Advanced Computer Research,2012.
the entire set of negative captions. Recall was computed as the [MSO95] M. Stricker and M. Oreng, Similarity of color
number of correctly indexed captions (TP) divided by the images, Proc. SPIE, Storage and Retrieval for
number of captions that truly contained the concept (TP + Image and Video Databases, (1995) 381392.
TN). [NIK13] NiketAmoda and Ramesh K Kulkarni. Efficient
Image Retrieval Using Region Based Image
Retrieval, Signal & Image Processing: An
International Journal (SIPIJ), June 2013.
[PRA03] B.G. Prasad, K.K. Biswas, and S.K. Gupta.
Region-Based Image Retrieval Using Integrated
Color, Shape and Location Index, Indian
Institute of Technology, October 2003.
[RIS14] Risvana Fathima, M.Kaja Mohaideen,
S.Sakkaravarthi. Content Based Image Retrieval
Algorithm Using Local Tetra Texture Features,
International Journal of Advanced
Computational Engineering and Networking,
Jan-2014.
[SHA09] SharadhRamaswamy and Kenneth Rose,Fellow,
IEEE Towards Optimal Indexing for Relevance
Feedback in Large Image databases IEEE
Fig 4.6 Proposed Method and Existing Method In Terms transaction on Image Processing. December
Of Precision 2009.
[SON13] Sonya Eini and AbdolahChalechale. Shape
8. CONCLUSION Retrieval Using Smallest Rectangle Centroid
A novel method for image retrieval using local Distance, International Journal of Signal
binary pattern. The existing methods have various drawbacks. Processing, Image Processing and Pattern
The proposed method overcomes the earlier issues. The Recognition, August,2013.
generation of feature vector with local binary pattern in this [YLI07] Ying Liu, Dengsheng Zhang, Guojun Lu, Wei-
process of a monogram image and the feature vector has been Ying Ma, A survey of content-based image
generated. The similarity measure consolidated with retrieval with high-level semantics, Elsevier J.
Euclidean distance. The similarity of the monogram image Pattern Recognition, 40, 262-282, 2007.
and the layer image has been done by computing the distance [YON99] Yong Rui and Thomas S. Huang. Image
between them. The performance of an image retrieval is done Retrieval: Current Techniques, Promising
with the Precision & recall. The experimental of proposed Directions, and Open Issues Journal of Visual
method provides 0.875, 0.8, 0.75, 0.51, 0.74, 0.83, are 0.18, Communication and Image Representation,
0.13, 0.07, 0.06, 0.09, are compared with the existing local 1999.
binary pattern method provide an efficient one.

99 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security
National Conference on Research Issues in Image Analysis & Mining Intelligence (NCRIIAMI-2015)

An Efficient Ad Hoc Network for Theft Prevention under


Honest Node Detection
M.Parameswari Dr.S.SUKUMARAN
Associate Professor & Ph.D Scholar Associate Professor
Department of Computer Science Department of Computer Science
Erode Arts and Science Erode Arts and Science
College (Autonomous) College (Autonomous)
Erode. Erode.
parameswari678@gmail.com Prof_sukumar@yahoo.co.in

ABSTRACT only authorized personnel will know which vehicle has


requested help and would be able to assist accordingly. This
In vehicular networks, moving vehicles are enabled to ensures that the malicious users will not know which vehicle
communicate with each other via inter-vehicle has requested help for a fuel shortage, hence protecting the
communications as well as with road-side units (RSUs) in vehicle and its occupants. According to Aboudagga et al. [12]
vicinity via roadside-to-vehicle communications. In urban and Sehgal et al. [13], it is also necessary for the vehicle to be
vehicular networks where the privacy, especially the location registered with some central authority so that nodes caught
privacy of vehicles should be guaranteed, vehicles need to be sending erroneous and malicious messages can be determined
verified in an anonymous manner. In urban vehicular and held accountable. VANETs are an application of Mobile
networks, where privacy, especially the location privacy of Ad-Hoc Networks (MANETs). Sun [15], defines a MANET
anonymous vehicles is highly concerned, anonymous as a collection of wireless nodes that can be dynamically set
verification of vehicles is indispensable. In this paper up anywhere and anytime without using any pre-existing
proposes a novel Sybil attack detection mechanism, Footprint, network infrastructure.
using the trajectories of vehicles for identification while still
preserving their location privacy. A location-hidden
authorized message generation scheme is designed for two
objectives: first, RSU signatures on messages are signer
ambiguous so that the RSU location information is concealed
from the resulted authorized message; second, two authorized
messages signed by the same RSU within the same given
period of time ( temporarily linkable) are recognizable so that
they can be used for identification.

1. INTRODUCTION
In recent times, many projects have initiated research to
investigate Ad-Hoc Networks as a communication technology
for vehicle-specific applications, within the wider fields of
intelligent transportation systems (ITS). Its are computerized
systems that have diverse applications and are connected with
Vehicle Transportation. The computerized systems are made
up of computers, communications, sensor and control
technologies and management strategies [1].

1.1 Vehicle Adhoc Network Fig 1.1 VANET System Architecture


Vehicular Ad-Hoc Networks (VANETs) make it possible for
vehicles to broadcast warnings about environmental hazards
1.2 Characteristics
(e.g., ice on the pavement), traffic and road conditions (e.g., There are some special characteristics present in VANETs
emergency braking, congestion, or construction sites), and which are exceptions to general MANETs. These are listed
local (e.g., tourist) information to other vehicles [3]. Once it is below:
known that there is a traffic jam, road closure or accident
ahead, a driver may safely avoid the route and save time. VANETs are potentially large-scale networks [1].
Communication between vehicles is therefore suitable Road configuration, traffic laws and speed limits on
because vehicles are able to distribute warnings to other roads affect the mobility of vehicles [1]. Simulating
vehicles. Messages can be sent from vehicles to summon for vehicle traffic is thus a complex task and is a focus of
help if needed and to inform authorities of dangerous study for applications in transportation engineering [1].
behaviour on roads. In another paper by Papadimitratos et al.
[5], VANETs are said to assist to make roads safer, and offer Vehicles are able to provide more resources than typical
convenience. mobile devices used in MANETs. Therefore,
conserving these resources in VANETs is not a major
However, if the vehicles permanent identity is hidden from all concern.
other vehicles and can only be seen by the authorities then

100 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security
National Conference on Research Issues in Image Analysis & Mining Intelligence (NCRIIAMI-2015)

1.3 Vehicle Adhoc Network Architectures To tackle the problem of spoofing attacks in mobile
sensor networks.
VANETs are made up of vehicles (which are equipped with
on board units) and road-side infrastructure units (RSUs). The The scheme detects mobile replicas in an efficient
architecture in a VANET is split into 3 domains. In Vehicle and robust manner at the cost of reasonable
domain, Ad-Hoc Domain and Infrastructure Domain. As can overheads.
be seen in the diagram above the In-Vehicle Domain consists The number of attackers when multiple adversaries
of an OBU and many AUs (Application Units). The AUs are masquerading as the same node identity.
user devices for example mobile phones and PDA that
perform certain functions when interacting with the OBU. The The architecture of the system model, which consists of three
Ad-Hoc Domain consists of OBUs in vehicles and RSUs, interactive components:
which are along the roadside.
1. Road Side Unit (RSU):
1.4 Vehicle Adhoc Network Application 2. On-board units (OBUs):
Papadrimitratos et al. [5] maintain that VANETs are 3. Trust authority (TA):
developed as a means to enhance transportation safety and
efficiency. Safety applications are just one of the applications Road Side Unit (RSU):
for VANETs, and are discussed below, together with other The RSU can be deployed at intersections or any area of
applications: interest (e.g., bus stations and parking lot entrances). A typical
Safety Applications RSU also functions as a wireless AP (e.g., IEEE 802.11x)
Automated Highways which provides wireless access to users within its coverage.
IP Based Applications Onboardunits(OBUs):
High Mobility
Location Awareness The OBU are installed on vehicles. A typical OBU can equip
Security with a cheap GPS receiver and a shortrange wireless
communicationmodule(e.g.,DSRCIEEE802.11p[18]).
2. PROBLEM DESCRIPTION
Trustauthority(TA):
Detecting Sybil attacks in urban vehicular networks, however,
is very challenging. First, vehicles are anonymous. There are The TA is responsible for the system initialization and RSU
no chains of trust linking claimed identities to real vehicles. management.TheTAisalsoconnectedtotheRSUbackbone
Second, location privacy of vehicles is of great concern. network.
Location information of vehicles can be very confidential.
DeploymentofRoadSideUnit(Rsu)
2.1 Objectives In Footprint, vehicles require authorized messages issued
2.1.1 Main Objectives from RSUs to form trajectories, which should be statically
installedastheinfrastructure.
To detect same identity based multi adversaries
2.1.3 Drawbacks
To Localization the hacker node
The existing system has following disadvantages,
Cluster based victim node detection in the network
Failed RSU scenario is not covered in existing
Detect the presence of spoofing attacks system.
Determine the number of attackers If an RSU is failed in a given time, the trajectory
Localize multiple adversaries and eliminate them. created at that event will not contain that RSU
informationalongthetrajectory.
To create Mobile Node Network.
So other trajectory with this RSU (prepared after
To make Mobile movement (Random walk) within thisRSUisonline)willlookdistinct.
given speed.
Before comparing two trajectories for similarity,
To update location information to its neighbors. thetrajectoryinformationshouldbemodifiedsuch
To update location information of all nodes to Base that this missed RSU is to be added in the
Station. trajectory.

To make replicate node attack. 2.2 Proposed System and Its Advantages
To make Base station identifies the mobile The proposed system covers all the existing system approach.
replication attack. In addition, RSU failure is also taken into consideration. The
trajectory created at an event with missed RSU, is modified
2.1.2 Specific Objectives before Sybil attack detection. Mis-interpretion of honest
A fast and effective mobile replica node detection vehicle as suspicious node is avoided. The length of the
scheme is proposed using the Sequential Probability messages is linearly increasing when the trajectory length is
Ratio Test. more.

101 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security
National Conference on Research Issues in Image Analysis & Mining Intelligence (NCRIIAMI-2015)

Reducing the message size is not covered in the existing 8. FIND SYBIL ATTACK
system. So the repeated occurrences of adjacent RSUs are
eliminated in the proposed system. For example a trajectory The find Sybil attack is the module is used to detect the
{R1, R4, R5, R6, R4, R5, R7} can be modified as {R1, R4, unauthorized vehicle in the network.
R5, R6, R7} along with length of trajectory is subtracted with 9. COMMUNICATION
two. To implement this, the proposed system covers
Infrastructure Construction, Generating Location-Hidden In the partial signature creation, the input provided as two pair
Trajectory (trajectory reconstruction with missed RSU) and namely private key of the road side unit and public key of the
Sybil attack detection. vehicle, the message should be provided then the message
should be encrypted and partial signature value executed in
2.3 Advantages the application.

The proposed system has following advantages. SYBIL ATTACK DETECTION

Differentiate honest vehicle with suspicious node Sender Received


Existing Proposed
easily than existing system. RSU RSU
Length of the trajectory information is reduced
without loss of information. 1 2 1.170895 0.422507

3. MODULE DESCRIPTION 2 3 1.470895 0.50258


The following modules are present in the thesis.

1. SHOW NETWORK 3 4 1.850895 0.672581


2. ADD ROAD SIDE UNIT
3. UPDATE ROAD SIDE UNIT FAILURE
4 5 2.070895 0.69258
4. ADD NEIGHBOR ROAD SIDE UNIT
5. VIEW ROAD SIDE UNITS
6. VIEW NEIGHBOR ROAD SIDE UNITS 5 6 2.525395 0.70258
7. MESSAGE
8. FIND SYBIL ATTACK
9. COMMUNICATION 6 7 1.170895 0.758281

1. SHOW NETWORK
In this module, a Typical Vehicular Network (with RSU 4. RESULT AND DISCUSSION
Installed) is shown graphically.
In this section, we examine the performance of Footprint in
2. ADD ROAD SIDE UNIT recognizing forged trajectories (issued by malicious vehicles)
and actual ones (provided by honest vehicles) through trace-
In this module, Road Side unit details are added and saved to
driven simulations. We consider two key metrics:
RSU table. It will detect the sybil attack using the proposed
approach. 4.1. Key Metrics of the Foot Print
3. UPDATE ROAD SIDE UNIT FAILURE In the Sybil attack detection scheme, it is possible that a
In this module, failure occurred in Road Side unit details are trajectory of an honest vehicle could be mixed with other
updated and saved to RSU table. The RSUs Active status trajectories (either malicious Sybil trajectories or other honest
will be set to 0. The status will be announced to all other ones) especially when the length of the trajectory is short.
RSUs by the Trusted Agent. This will cause false alarm of Sybil trajectories. This issue can
be largely mitigated by comparing multiple sets of trajectories
4. ADD NEIGHBOR ROAD SIDE UNIT issued in different events. If the probability for an honest
In this module, Road Side Unit id, Neighbor Road Side Unit trajectory in an event of a vehicle being treated as Sybil.
id and Distance details are added and saved to 1. False positive error: is the proportion of all actual
NeighborRSUs table. This distance information will assist trajectories that are incorrectly identified as forged
the sybil attack detection. trajectories.
5. VIEW ROAD SIDE UNITS 2. False negative error: is the proportion of all forged
In this module, Road Side unit details are fetched from RSU trajectories that are falsely identified as actual trajectories.
table. The records are displayed using data grid view control.
5. CONCLUSION
6. VIEW NEIGHBOR ROAD SIDE UNITS
A Sybil attack detection scheme named Footprint is developed
In this module, Road Side Unit and its Neighbor RSU details for urban vehicular networks. Consecutive authorized
are fetched from Neighbor RSUs table. The records are messages obtained by an anonymous vehicle from RSUs form
displayed using data grid view control. a trajectory to identify the corresponding vehicle. Location
privacy of vehicles is preserved by realizing a location-hidden
7. MESSAGE signature scheme. Utilizing social relationship among
The message module is used to update the message between trajectories, Footprint can find and eliminate Sybil
road side unit to vehicle on board unit. trajectories. The Footprint design can be incrementally

102 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security
National Conference on Research Issues in Image Analysis & Mining Intelligence (NCRIIAMI-2015)

implemented in a large city. It is also demonstrated by both


analysis and extensive trace-driven simulations that Footprint
can largely restrict Sybil attacks and can enormously reduce
the impact of Sybil attacks in urban settings (above 98 percent
detection rate).
With the proposed detection mechanism having much space to
extend, the future work is to continue to work on several
directions. First, in Footprint, it is assumed that all RSUs are
trustworthy. However, if an RSU is compromised, it can help
a malicious vehicle generate fake legal trajectories (e.g., by
inserting link tags of other RSUs into a forged trajectory).

6. REFERENCES
1. Maen M Artimy, William Robertson, and William J
Phillips, Algorithms and Protocols for Wireless and
Mobile Ad Hoc Networks, 1st ed., Azzedine Boukerche,
Ed.: John Wiley & Sons, 2009.
2. Sudipto Das, Security Issues in MANETs,
www.cs.ucsb.edu /~sudipto/ talks/ Security.pps.
3. P Papadimitratos, G Calandriello, J.-P Hubaux, and A
Lioy, "Impact of Vehicular Communications Security on
Transportation Safety," in IEEE INFOCOM Workshop,
Laussane, 2008, pp. 1- 6.
4. A Boukerche, Algorithms and Protocols for Wireless,
Mobile Ad Hoc Networks , 1st ed.: Wiley-IEEE Press , 2009.
5. Many, VANET Vehicular Applications and Inter-
Networking Technologies, 1st ed., Hannes Hartenstein and
Kenneth P. Laberteaux, Eds. West Sussex, United
Kingdom: John Wiley and Sons Ltd, 2010.
6. Yunpeng Wang, Zhenguo Yi, Daxin Tian, and Haiying
Xia, "Safety Message Transmitting Method for Vehicle
Infrastructure Integration," in 6th Advanced Forum on
Transportation of China, Beijing, 2010.
7. Umesh Sehgal, Kuljeet Kaur, and Pawan Kumar, "Security
in Vehicular Ad-hoc Networks," in Second International
Conference on Computer and Electrical Engineering,
2009, pp. 485 - 488.
8. I.A Sumra, H Hasbullah, I Ahmad, and J.-L bin Ab
Manan, "Forming Vehicular Web of Trust in VANET," in
Saudi International Electronics, Communications and
Photonics Conference, 2011.

103 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security
National Conference on Research Issues in Image Analysis & Mining Intelligence (NCRIIAMI-2015)

A Study on Various Approaches handled for Clustering

R.Pushpalatha Dr.K.Meenakshi Sundaram


Research Scholar in Computer Science, Associate Professor,
Erode Arts and Science College, Erode and Department of Computer Science,
Assistant Professor, Department of Computer Erode Arts and Science College,
Science, Kongu Arts and Science College, Erode. Erode.

Email Id.: rplphd14@gmail.com Email Id.: lecturerkms@yahoo.com

ABSTRACT biology, image pattern recognition, security, business


The process of Data mining is to extract patterns from data. intelligence and Web search. Cluster analysis [2]can be used
The ambition of the data mining process is to extract as a standalone data mining tool to achieve data distribution,
information from a large data set and transform it into an or as a pre-processing step for other data mining algorithms
understandable form for further use. There are various steps operating on the detected clusters. The Data mining includes
that are involved in Data mining. They are Data Integration, the anomaly detection, association rule learning,
Data Selection, Data Cleaning, Data Transformation, Data classification, regression, summarization and clustering. In
Mining, Pattern Evaluation and Knowledge Presentation, this paper, clustering analysis is done. Cluster Analysis, an
Decisions / Use of Discovered Knowledge. Data Mining main automatic process to find similar objects from a database.
four steps of Assemble data, Apply data mining tools on
datasets, Interpretation and evaluation of result, Result The stages of clustering are shown in figure1.
application. Clustering is one of the most important concept in
data analysis and data mining applications. The Process of Clusters of Data
clustering is grouping a set of objects so that objects in the
Raw Clustering
same group are more similar to each other than to those in Data Algorithm
other groups (clusters). Different types of clusters are Well-
separated clusters, Center-based clusters, Contiguous clusters, Figure1: Stages of Clustering
Density-based clusters and Shared Property or Conceptual
Clusters. The main tasks of the data mining are Predictive and A cluster is a collection of data objects that are similar to one
the descriptive task. The different number of algorithms such another within the same cluster and are dissimilar to the
as hierarchical, partitioning, grid and density based algorithms objects in other clusters. A good clustering algorithm is to
are used to perform clustering. The example of the identity clusters irrespective of their shapes. The other
connectivity based clustering is Hierarchical clustering, requirements of clustering algorithms are scalability, ability to
Centroid based clustering is the Partitioning clustering. deal with noisy data, insensitivity to the order of input
Density based clusters are defined as area of higher density records, etc. Any cluster should have two main properties;
then the remaining of the data set. Grid based clustering is the they are low inter-class similarity and high intra-class
fastest processing time that typically depends on the size of similarity. Here intra-cluster distances are minimized and
the grid instead of the data. The grid based methods use the inter-cluster distances are maximized. It is shown in figure 2.
single uniform grid mesh to partition the entire problem
Intra-Cluser
domain into cells. In this survey paper, a review of clustering
and its different techniques in data mining is done. distances are
minimized. Inter-
Cluster
Keywords distances
Clustering, Types of Clustering, Task of Data Mining, are
Classification of Clustering, Clustering methods. maximized

1. INTRODUCTION
The process of analyzing the data from different perspectives
and summarizing it into useful information is known as Data
mining. The main functionalities of data mining are
characterizations and discrimination, the mining of frequent
patterns, associations and correlations, classification and
regression, clustering analysis, outlier analysis [2]. Clustering
is one of the most interesting and important topic in data Figure 2 : Intra and Inter Cluster Distances
mining. The main aim of clustering is to find intrinsic Data mining is a multi-step process. It requires accessing and
structures of the data, and organize them into a meaningful preparing data for a data mining algorithm, mining the data,
subgroups for further study and analysis. Clustering is a data analyzing results and taking appropriate action. The accessed
mining tool and it has a root in many application areas such as data can be stored in one or more operational databases, a data

104 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security
National Conference on Research Issues in Image Analysis & Mining Intelligence (NCRIIAMI-2015)

warehouse or a flat file. In data mining the data is mined using 2.1. Well-separated clusters
two learning approaches i.e. supervised learning and A cluster is a set of points so that any point in a cluster is
unsupervised clustering. nearest (or more similar) to every other point in the cluster as
compared to any other point that is not in the cluster.
1.1.Supervised Learning
The class label provided for each training tuple is known as
supervised learning. Learning means training data are 2.2. Center-based clusters
analyzed by a classification algorithm. In supervised learning A cluster is a set of objects such that an object in a cluster is
also called directed data mining. The two groups of variables nearest (more similar) to the center of a cluster, than to the
are explanatory variables and dependent variables. The target center of any other cluster. The center of a cluster is often a
of the analysis is to specify a relationship between the centroid.
explanatory variables and the dependent variable as it is done
in regression analysis. To apply directed data mining 2.3. Contiguous clusters
techniques the values of the dependent variable must be A cluster is a set of points so that a point in a cluster is nearest
known for a sufficiently large part of the data set. The training (or more similar) to one or more other points in the cluster as
data includes both the input and the desired results. These compared to any point that is not in the cluster.
methods are fast and accurate. The correct results are known
and are given in inputs to the model during the learning
process. Supervised models are neural network, Multilayer 2.4. Density-based clusters
Perception, Decision trees. A cluster is a dense region of points, which is separated by
according to the low-density regions, from other regions that
1.2 Unsupervised Learning is of high density.
The class label of each training tuple is not known, and the
number of set of classes to be learned may not be known in 2.5. Shared Property or Conceptual
advance is known as unsupervised learning. In unsupervised
learning situations all variables are treated in the same way, Clusters
there is no distinction between explanatory and dependent It finds clusters that share some common property or
variables. The model is not provided with the correct results represent a particular concept.
during the training. It can be used to cluster the input data in
classes on the basis of their statistical properties only. 3.DATA MINING TASKS
From a machine learning perspective clusters correspond to Data mining tasks[3] are generally divided into two major
hidden patterns, the search for clusters is unsupervised categories:
learning, and the resulting system represents a data concept.
From a practical perspective clustering plays an outstanding 3.1 Predictive task
role in data mining applications such as scientific data The goal of this task is to predict the value of one particular
exploration, information retrieval and text mining, spatial attribute, based on values of other attributes. The attributes
database applications, Web analysis, CRM, marketing, that is used for making the prediction is named as independent
medical diagnostics, computational biology, and many others. variable. The other value which is to be predicted is known as
Presenting data by fewer clusters necessarily loses certain fine the Target or dependent value.
details (loss in data compression), but achieves simplification.
It represents many data objects by few clusters, and hence, it
models data by its clusters. Clustering is often one of the first 3.2 Descriptive task
steps in data mining analysis. It identifies groups of related The purpose of this task is surmise underlying relations in
records that can be used as a starting point for exploring data .In descriptive task of data mining, values are
further relationships. Clustering is a data mining (machine independent in nature and it frequently require post-
learning) technique used to place data elements into related processing to validate results.
groups without advance knowledge of the group definitions.
Clustering techniques fall into a group of undirected data
mining tools. The goal of undirected data mining is to 4. CLASSIFICATION OF CLUSTERING
discover structure in the data as a whole. Clustering is the main task of Data Mining. And it is done by
the number of algorithms. The most commonly used
In general, there are two types of attributes associated with algorithms in Clustering are Hierarchical, Partitioning,
input data in clustering algorithms, i.e., numerical attributes, Density and Grid based algorithms.
and categorical attributes. Numerical attributes are those with
a finite or infinite number of ordered values, such as the 4. 1. PARTITIONING ALGORITHMS
height of a person or the x-coordinate of a point on a 2D In partitioning method, [1] a partitioning algorithm arranges
domain. On the other hand, categorical attributes are those all the objects into various partitions, where total number of
with finite unordered values, such as the occupation or the partitions is less than the total number of objects. Partitioning
blood type of a person. Many different clustering techniques algorithms divide data into several subsets. The reason of
have been defined in order to solve the problem from dividing the data into several subsets is that checking all
different perspective, i.e. partition based clustering, density possible subset systems is computationally not feasible; there
based clustering, hierarchical methods and grid-based are certain greedy heuristics schemes are used in the form of
methods etc iterative optimization. Specifically, this means different
relocation schemes that iteratively reassign points between the
2. GENERAL TYPES OF CLUSTERS k clusters. Relocation algorithms gradually improve clusters.

105 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security
National Conference on Research Issues in Image Analysis & Mining Intelligence (NCRIIAMI-2015)

medoid and non medoid object.


5. Swap medoid with non medoid object to
make new Medoids.
6. If total cost of swapping is negative process
is again started from step2.
7. Repeat this process until no swapping occurs.

PAM (Partition Around Medoids) uses a k-medoid method to


identify the clusters. It is suitable for small size data set.
Figure 3: Partitioned Clustering CLARA (Clustering LARge Applications) to determine k-
medoids through an iterative optimization.
The two main methods of partitioning clustering are k-
Means Method and k-Medoids Method. In k-means
algorithms, each cluster is represented by the center of gravity 4.2. HIERARCHICAL ALGORITHMS
of the cluster. Which is a mean (average) points within a Hierarchical clustering is a set of nested clusters organized as
cluster. This works efficiently only with numerical attributes. a hierarchical tree. It is shown in figure 4.
The k-means algorithm is the most popular clustering tool that
is used in scientific and industrial applications. K-medoid
algorithms, where each cluster is represented by one of the
objects of the cluster located near the center. We are
discussing the k-mean algorithm as: In k-means algorithm, a
cluster is represented by its centroid, and it can be negatively
affected by a single outlier. It is a method of cluster
analysis which aims to partition n observations into k
clusters in which each observation belongs to the cluster with
the nearest mean.[8].

The basic algorithm is very simple

1. Arbitrarily choose K objects from D as the


initial clusters centers;
Repeat
2. (Re) assign each object to the cluster to
which the object is most similar, based on
the mean value of the objects in the cluster;
3. Update the cluster means, that is, calculate Figure 4 : Hierarchical Clustering
the mean value of the objects for each
cluster; Hierarchical clustering is a method it groups all the objects
Until no change. into a tree of clusters that are arranged in hierarchical manner,
Hierarchical methods works on bottom-up or top-down
approaches. It is the connectivity based clustering algorithms.
The k-means algorithm has the following important The hierarchical algorithms build clusters gradually.
properties: Hierarchical clustering generally fall into two types: In
hierarchical clustering, in single step, the data are not
It is efficient in processing large data sets. partitioned into a particular cluster. It takes a series of
It often terminates at a local optimum. partitions, which may run from a single cluster containing all
It works only on numeric values. objects to n clusters each containing a single object.
The clusters have convex shapes. Hierarchical Clustering is subdivided into Agglomerative
method that is the single clusters are merged to make larger
cluster and the process of merging continues until all the
We are discussing the k-medoids algorithm as: In k-medoids single clusters are merged into one big cluster that consists of
algorithm, a cluster is represented by one of the objects of the all the objects. It follows bottom-up approach. The second one
cluster located near the center. Here the partitioning is based is Divisive method. Here all the objects are arranged within a
on the sum of the dissimilarities between each object and its big single cluster and the large cluster is continuously divided
reference point (medoids).Partition n observations into k into smaller clusters until each cluster has a single object.
clusters in which each observation belongs to the cluster with
the nearest mean.
4.2.1 Advantages of Hierarchical Clustering
The k-medoids algorithm works on following steps: Embedded flexibility regarding the level of
granularity.
1. Randomly Select K objects that represent Ease of handling any forms of similarity or distance.
a reference point as medoids. Applicability to any attributes type.
2. Assign remaining object to a cluster that is 4.2.2 Disadvantages of Hierarchical Clustering
most similar to its medoid.
Regarding the Selection of merge or split points.
3. Randomly select the non-medoid object.
The clusters are merged or divided into smaller
4. Calculate the cost function that is used
clusters then can not change the generated clusters
to determine the dissimilarity between

106 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security
National Conference on Research Issues in Image Analysis & Mining Intelligence (NCRIIAMI-2015)

and also can not perform object swapping between set having different size and shape. The algorithms
clusters. parametric the agglomerative method used in the pre-
Most hierarchal algorithm do not revisit once clustering step and the similarity metrics of interest.
constructed clusters with the purpose of DESCRY has a very low computational complexity,
improvement. indeed it requires O(Nmd) time, for high-dimensional
data sets, and O(N log m) time, for low dimensional data
In hierarchical method clustering is done on hierarchical sets, where m can be considered a constant characteristic
decomposition of data. Major drawback of this method is once of the data set. Thus DESCRY scales linearly both the
a step is done then it never be undo. Multiphase clustering size and the dimensionality of the data set [9]. Despite its
BIRCH and Chameleon are used to overcome this issue. low complexity, qualitative results are very good and
comparable with those obtained by state of the art
clustering algorithms. Future work includes, among other
4..3 MULTIPHASE CLUSTERING topics, the investigation of similarity metrics particularly
4.3.1.BIRCH meaningful in high-dimensional spaces, exploiting
BIRCH [3] (Balanced Iterative Reducing And Clustering summaries extracted from the regions associated to
Using Hierarchies) is designed for clustering a large amount midpoints.
of numerical data by integration of hierarchical clustering (at
the initial micro-clustering stage) and other clustering
methods such as iterative partitioning (at the later macro- 4.4.2. GDBSCAN and its Applications
clustering stage). It overcomes the two difficulties of This paper presents the clustering algorithm
agglomerative clustering methods namely, scalability and the GDBSCAN (Density-Based Clustering in Spatial Databases)
inability to undo what was done in the previous step. BIRCH generalizing the density-based algorithm DBSCAN in two
introduces two concepts, clustering feature (CF) and important ways. GDBSCAN can cluster point objects as well
clustering feature tree (CF tree), these both features represent as spatially extended objects according to both, their spatial
all objects into clusters as hierarchical representation. These and their non-spatial attributes. After a review of related work,
structures help the clustering method achieve good speed and the general concept of density-connected sets and an
scalability in large databases and also make it effective for algorithm to discover them were introduced [10]. A
incremental and dynamic clustering of incoming objects. performance evaluation, analytical as well as experimental,
BIRCH examines the data objects two or three times to showed the effectiveness and efficiency of GDBSCAN on
improve the quality of the built CF tree. The drawback of large spatial databases.
BIRCH is that is suitable only for spherical clusters.
4.5.GRID BASED ALGORITHMS
4.3.2. Chameleon Method 4.5.1.STING
Chameleon [9] is a hierarchical clustering algorithm that uses STING [11] (STastical INformation Grid) is a grid-based
dynamic modelling to determine the similarity between pairs multi-resolution clustering technique in which the spatial area
of clusters. In Chameleon, cluster similarity is assessed based is divided into rectangular cells. There are usually several
on how well-connected objects are within a cluster and on the levels of such rectangular cells corresponding to different
proximity of clusters. That is, two clusters are merged if their levels of resolution, and these cells form a hierarchical
interconnectivity is high and they are close together. Thus, structure: each cell at a high level is partitioned to form a
Chameleon does not depend on a static, user-supplied model number of cells at the next lower level. Statistical information
and can automatically adapt to the internal characteristics of regarding the attributes in each grid cell (such as the mean,
the clusters being merged. The merge process facilitates the maximum, and minimum values) is pre-computed and stored.
discovery of natural and homogeneous clusters and applies to Statistical parameters of higher-level cells can easily be
all types of data as long as a similarity function can be computed from the parameters of the lower-level cells. These
specified. parameters include the following: the attribute-independent
parameter, count; the attribute-dependent parameters, mean,
stdev (standard deviation), min (minimum), max (maximum);
4.4.DENSITY-BASED CLUSTERING and the type of distribution that the attribute value in the cell
Density based algorithm continue to grow the given cluster as follows, such as normal, uniform, exponential, or none (if the
long as the density in the neighborhood exceeds certain distribution is unknown).When the data are loaded into the
threshold [6]. This algorithm is suitable for handling noise in database, the parameters count, mean, stdev, min, and max of
the dataset. the bottom-level cells are calculated directly from the data.
The following points are enumerated as the features of this The value of distribution may either be assigned by the user if
algorithm. the distribution type is known beforehand or obtained by
Handles clusters of arbitrary shape hypothesis tests such as the c2 test. The type of distribution of
Handle noise a higher-level cell can be computed based on the majority of
Needs only one scan of the input datasets. distribution types of its corresponding lower-level cells in
Needs density parameters to be initialized. conjunction with a threshold filtering process. If the
DBSCAN, DENCLUE and OPTICS [6] are examples for this distributions of the lower level cells disagree with each other
algorithm. and fail the threshold test, the distribution type of the high-
level cell is set to none.

STING offers several advantages


4.4.1. DESCRY
This paper describes a new method, named DESCRY (a
The grid-based computation is query-independent,
Density Based Clustering Algorithm for Very Large Data
because the statistical information stored in each
Sets), to identify clusters in large high dimensional data

107 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security
National Conference on Research Issues in Image Analysis & Mining Intelligence (NCRIIAMI-2015)

cell represents the summary information of the data algorithms such as hierarchical, partitioning, grid and density
in the grid cell, independent of the query. based algorithms. Hierarchical clustering is the connectivity
The grid structure facilitates parallel processing and based clustering. Partitioning is the centroid based clustering,
incremental updating. the value of k-mean is set. Density based clusters are defined
The methods efficiency is a major advantage: as area of higher density then the remaining of the data set.
STING goes through the database once to compute Grid based clustering is the fastest processing time that
the statistical parameters of the cells. typically depends on the size of the grid instead of the data.
The grid based methods use the single uniform grid mesh to
The time complexity of generating clusters is O(n), where n is partition the entire problem domain into cells.
the total number of objects. After generating the hierarchical
structure, the query processing time is O(g), where g is the 6. REFERENCES
total number of grid cells at the lowest level, which is usually 1. Anitha, Elavarasi.S, Dr.Akilandeswari.J,
much smaller than n. Because STING uses a multi-resolution Dr.Sathiyabhama.B, A Survey On Partition
approach to cluster analysis, the quality of STING clustering Clustering Algorithm January 2011.
depends on the granularity of the lowest level of the grid
structure. If the granularity is very fine, the cost of processing 2. Arockiam.L.S, Baskar.S., Jeyasimman.L.
will increase substantially; however, if the bottom level of the Clustering Techniques in Data Mining 2012.
grid structure is too coarse, it may reduce the quality of cluster
analysis. As a result, the shapes of the resulting clusters are 3. Han.J, Kamber.M Data Mining: Concepts and
isothetic; that is, all of the cluster boundaries are either Techniques, 3rd edition, 443-491, 2012.
horizontal or vertical, and no diagonal boundary is detected.
This may lower the quality and accuracy of the clusters 4. Ilango.K, Dr.Mohan.V, A Survey of Grid
despite the fast processing time of the technique. Based Clustering Algorithms, International Journal
of Engineering Science and Technology, pp.3441-
4.5.2. CLIQUE 3446, 2010.
CLIQUE [11] (CLustering In QUEst) was the first algorithm
proposed for dimension-growth subspace clustering in high- 5. Gholamreza Esfandani, Mohsen Sayyadi, Amin
dimensional space. In dimension-growth subspace clustering, Namadchian, GDCLU: a new Grid-Density based
the clustering process starts at single-dimensional subspaces CLUstring algorithm, IEEE 13th ACIS
and grows upward to higher-dimensional ones. Because International Conference on Software Engineering,
CLIQUE partitions each dimension like a grid structure and Artificial Intelligence, Networking and
determines whether a cell is dense based on the number of Parallel/Distributed Computing, pp.102-107, 2012.
points it contains, it can also be viewed as an integration of
density-based and grid-based clustering methods. However, 6. Manish Verma, Mauly Srivastava, Neha Chack,
its overall approach is typical of subspace clustering for high- Nidhi Gupta, A Comparative Study of Various
dimensional space, and so it is introduced in this section. Clustering Algorithms in Data Mining,
International Journal of Engineering Research and
The ideas of the CLIQUE clustering algorithm are outlined as Applications (IJERA), Vol. 2, Issue 3, pp.1379-
follows. 1384, 2012.

Given a large set of multidimensional data points, 7. Pragati Shrivastava, Hitesh Gupta, A Review of
the data space is usually not uniformly occupied by Density-Based clustering in Spatial Data,
the data points. CLIQUEs clustering identifies the International Journal of Advanced Computer
sparse and the crowded areas in space (or units), Research ISSN (print), pp.2249-7277, September-
thereby discovering the overall distribution patterns 2012.
of the data set.
8. Ritu Sharma, Afshar alam, Anita Rani K-Means
A unit is dense if the fraction of total data points
Clustering in Spatial Data Mining using Weka
contained in it exceeds an input model parameter. In
Interface, International conference on Advances in
CLIQUE, a cluster is defined as a maximal set of
Communication and Computing Technologies
connected dense units.
(ICACACT proceedings published by International
journal of computer applications (IJCA), pp 26-30,
CLIQUE performs multidimensional clustering in two steps.
2012.
In the first step, CLIQUE partitions the d dimensional data
space into non-overlapping rectangular units, identifying the
9. Umadevi Chezhian, Thanappan Subhash.
dense units among these. This is done in one Dimensional for
M.Raghvan Hierarchical Sequence Clustering
each dimension.
Algorithm for Data Mining Proceedings of the
World Congress on Engineering 2011 Volume III
5. CONCLUSION WCE 2011, July 6 - 8, London, U.K ,2011
The overall goal of the data mining process is to extract
information from a large data set and transform it into an 10. Vijayalakshmi .M, .Renuka Devi M, A Survey of
understandable form for further use. Clustering is important in Different Issue of Different clustering Algorithms
data analysis and data mining applications. It is the task of used in Large Data sets, International Journal of
grouping a set of objects so that objects in the same group are Advanced Research in Computer Science and
more similar to each other than to those in other groups Software Engineering, pp.305-307, 2012.
(clusters). Clustering can be done by the different no. of

108 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security
National Conference on Research Issues in Image Analysis & Mining Intelligence (NCRIIAMI-2015)

11. Wei Wang, Jiong Yang, and Richard Muntz ,


STING : A Statistical Grid Approach to
Spatial Data Mining, Department of
Computer Science, University of California, Los
Angels.

109 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security
National Conference on Research Issues in Image Analysis & Mining Intelligence (NCRIIAMI-2015)

A Study on Various Classification Techniques for Detecting


Liver Cancer in Human

R.Rajamani M.RATHIKA
Assistant Professor Research Scholar
Department of Computer Science Department of Computer Science
PSG College of Arts & Science, Coimbatore PSG College of Arts & Science, Coimbatore
rajamani_devadoss@yahoo.co.in rathi2109@gmail.com

ABSTRACT have been discovered to classify gene expression data. The


Cancer is a generic term for a large group of diseases that can performance is evaluated when applying the optimized kernel
affect any part of the body. Today, Liver cancer is also a in classifying gene expression data. The optimized kernel
leading cause of cancer deaths worldwide, accounting for provides better accuracy.
more than 600,000 deaths each year. Detection of liver cancer Naive Bayes Algorithm tested on liver cancer disease datasets
in its early stage is the key of its cure. The diagnosis of liver by P.Rajeswari, G.Sophia Reena, time taken to run the data
cancer in its early stage mainly includes utilizing X-ray chest for result is fast when compared to other algorithms. It shows
films, CT, MRI, etc. To detect liver cancer, the related the enhanced performance according to its attribute. Attributes
datasets in the hospitals and laboratory can be processed into are fully classified by this algorithm and it gives 97.10% of
useful information. For this reason the various data mining accurate result. The data mining method used to build the
techniques can be applied in this field. The data mining is the model is classification. The data analysis is processed using
process of finding data from different perspectives and WEKA data mining tool for exploratory data analysis,
reporting into useful data. In this paper, we proposed the machine learning and statistical learning algorithms. The
various data mining concepts of classification methods such training data set consists of 345 instances with 7 different
as rule set classifiers, decision trees, K-Nearest Neighbor attributes. The instances in the dataset are representing the
Algorithm. This paper also focuses on Neural Networks, results of different types of testing to predict the accuracy of
Support Vector Machine (SVM) for detecting the liver cancer liver disease. The performance of the classifiers is evaluated
in a better way. and their results are analyzed.
R.Mallika and V.Saravanan defined a novel method for cancer
Keywords classification using expressions of very few genes. This
Liver cancer, Data mining, Rule set Classifiers, decision trees, method uses the same classifier for both selection and
Neural Networks, K-Nearest Neighbor, Support Vector classification. This method used three datasets such as
Machine. Lymphoma, Liver and Leukemia datasets from micro array
gene expression data. The classifiers as Support vector
1. INTRODUCTION machines-one against all (SVM-OAA), K nearest neighbour
Cancer is one of the widespread diseases leading to fatal (KNN) and Linear Discriminant analysis (LDA) were
death. One defining feature of cancer is the rapid creation of compared with one another. In this research, two different
abnormal cells that grow beyond their usual boundaries, and classification models have been built using two different
which can then invade adjoining parts of the body and spread classifiers which is ANN and SVM. In this research, the
to other organs. The average age at diagnosis of liver cancer is performance of ANN and SVM has been examined in
63. More than 95% of people diagnosed with liver cancer are classifying the BUPA Liver Disorder dataset. Experimental
45 years of age or older. About 3% are between 35 and 44 results shows that SVM classifier gives better performance
years of age and about 2% are younger than 35. Liver cancer than ANN for 493 liver cancer classification in terms of
is much morecommon in countries in sub-Saharan Africa and accuracy, specificity and AUC value; 63.11%, 100.00% and
Southeast Asia than in the US. In many of these countries it is 68.34%. This work indicates that SVM can be effectively
the most common type of cancer. More than 700,000 people used to help the medical experts to diagnose liver cancer.
are diagnosed with this cancer each year throughout the N. Revathy And R. Amalraj defined a new method to process
world. micro array data for liver cancer classification. There are
several methods were available to rank the gene expression
2. LITERATURE REVIEW data such as T-score and ANOVA. But those are not suitable
To identify the liver cancer, Huilin Xiong and Xue-Wen Chen for large data sets. To rectify this problem the author proposed
deals with the approach called kernel function, which the technique is the Enrichment score. The Classifier used
improves the performance of the classifier in genetic data. The here Support Vector Machine (SVM). The data set is
efficiency of a kernel approach has been probed in which it randomly divided into two one for training and the remains
depends upon on optimizing a data-dependent kernel model. for testing. The classifier is trained with the data. The
The K-nearest-neighbor (KNN) and support vector machine lymphoma data set is used for performance demonstration.
(SVM) could be used as a classifier for performance analysis. Top genes can be selected from the ranked data, which is
Data set utilized here, ALL-AML Leukemia Data, Breast-ER, passed into the classifier one by one if no good accuracy is
BreastLN, Colon Tumor Data, Lung Cancer Data and Prostate attained, gene combination can be performed from the ranked
Cancer from micro array data. Kernel optimization schemes data set. Again the combination of genes can be classified

110 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security
National Conference on Research Issues in Image Analysis & Mining Intelligence (NCRIIAMI-2015)

until good accuracy is achieved. The result can be evaluated Besides the main culprits Hepatitis B or C, alcohol is
with the use of SVM and the T - Score and SVM and considered a major contributor to primary liver cancer.
Enrichment score. The performance accuracy and Diabetes too is a major risk factor with diabetics being at least
classification time can be compared with one another. The two and a half times more vulnerable to developing chronic
SVM With the enrichment score performed well with higher liver disease (CLD) that can progress to Liver Cancer. Non-
accuracy. alcoholic fatty liver disease (NAFLD) is also an emerging
Huang, YL et al suggested that SVM is used to concern. Liver cancers should not be confused with
construct the liver disease diagnosis system for testing the liver metastases, also known as secondary liver cancer, which
hepatic tumors as benign or malignant. They evaluated 164 is a cancer that originate from organs elsewhere in the body
liver lesions. Out of 164, 80 are malignant and 84 belong to and migrate to the liver. They are formed from either the liver
Hemangioma. The malignant group of tumors includes itself or from structures within the liver, including blood
primary HCC and Meta static tumors. The result shows that a vessels or the bile duct.
total accuracy rate of 81.7% is obtained. A multi class SVM
classifier is also proposed based on statistical learning theory 3.3 Symptoms
The following are the major symptoms of the liver cancer:
Stage I One tumor is found in the liver only A lump below the rib cage on the right side of the
abdomen
Pain near the right shoulder or on the right side of
One tumor is found, but it has spread to the
the abdomen
Stage II blood vessels, OR more than one tumor is
Jaundice (a disease that causes skin to yellow)
present, but they are all smaller than 5 cm.
Unexplained weight loss
Fatigue
There is more than one tumor larger than 5 Nausea or vomiting
Stage III Loss of appetite
cm
Dark-colored urine
The cancer has spread to other locations in
Stage IV 3.4 Stages of Liver Cancer
the body.
Liver cancer stages include the following:
for automatic classification in liver disease. Table 1. Liver Cancer Stages

3.5 How to diagnosis?


Early diagnosis can prevent health problems that may result
3. CAUSES OF LIVER CANCER from infection and prevent transmission of the virus.
3.1 The Organ: Liver Knowledge about the causes of liver cancer, and an
The liver is in the upper right part of the abdomen. It is the intervention to prevent and manage the disease is extensive.
largest organ in our body and it has many functions which Liver Cancer can be reduced and controlled by implementing
include: evidence-based strategies for cancer prevention, early
detection of cancer and management of patients with cancer.
Storing glycogen (fuel for the body) which is made
from sugars. When required, glycogen is broken 4. KNOWLEDGE DISCOVERY AND
down into glucose which is released into the DATA MINIG
bloodstream.
Helping to process fats and proteins from digested 4.1 Knowledge Discovery
food. The KDD is the process of turning the low-level data into
Making proteins that are essential for blood to clot. high-level knowledge. Hence, KDD refers to the non-trivial
Helping to process and/or remove alcohol, many extraction of implicit, previously unknown and potentially
types of medicines, toxins and poisons from the useful information from data in databases. While data mining
body. and KDD are often treated as equivalent words, but in real,
3.2 Causes data mining is an important step in the KDD process.
The leading cause of liver cancer is viral infection The Knowledge Discovery in Databases process comprises of
with hepatitis B virus or hepatitis C virus. The cancer usually a few steps leading from raw data collections to some form of
forms secondary to cirrhosis caused by these viruses. For this new knowledge. The iterative process consists of the
reason, the highest rates of liver cancer occur where these following steps:
viruses are endemic, including East-Asia and sub-Saharan
Africa. The fig.1 shows the liver cancer cells in the liver 1) Data cleaning: Also known as data cleansing. It is a phase
. in which noise data and irrelevant data are removed from the
collection.
2) Data integration: At this stage, multiple data sources, often
heterogeneous, may be combined in a common source.
3) Data selection: At this step, the data relevant to the analysis
is decided on and retrieved from the data collected.
4) Data transformation: Also known as data consolidation, it
is a phase in which the selected data is transformed into forms
Fig 1: Liver Cancer Cells
appropriate for the mining procedure.

111 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security
National Conference on Research Issues in Image Analysis & Mining Intelligence (NCRIIAMI-2015)

5) Data mining: It is the crucial step in which clever 5.1 Rule Set Classifiers
techniques are applied to extract patterns potentially useful. Complex decision trees can be difficult to understand, for
6) Pattern evaluation: In this interesting patterns representing instance because information about one class is usually
knowledge are identified based on given measures. distributed throughout them. The aim of the classification is to
7) Knowledge representation: It is the final phase in which the build a classifier based on some cases with some attributes to
discovered knowledge is visually represented to the user. tree. C4.5 introduced an alternative formalism consisting of a
list of rules of the form if A and B and C and ... then class
4.2 Data Mining Concept X, where rules for each class are grouped together. A case is
Data mining is the non-trivial process of identifying classified by finding the first rule whose conditions are
valid, novel, potentially useful, and ultimately understandable satisfied by the case; if no rule is satisfied, the case is assigned
patterns in data. Data mining is the extraction of the hidden to a default class.
predictive information from large databases. It is a powerful
new technology with great potential to analyze important IF conditions THEN conclusion: This kind of rule consists
information in the data warehouse. The effect of data mining of two parts. The rule antecedent (the IF part) contains one or
is extracting interested knowledge from large databases. more conditions about value of predictor attributes where as
In short, data mining refers to extracting the hidden, the rule consequent (THEN part) contains a prediction about
previously unknown and potentially useful information from the value of a goal attribute. An accurate prediction of the
database, and then offering the understandable knowledge, value of a goal attribute will improve decision-making
such as association rules, cluster patterns etc, so as to support process. IF-THEN prediction rules are very popular in data
users for decision-making. The process of data mining is mining; they represent discovered knowledge at a high level
shown in figure 2. of abstraction.
In the health care system it can be applied as follows:
(Symptoms) (Previous--- history) (Causeof--- disease).
Raw Data Example 1: If_then_rule induced in the diagnosis of liver
cancer.
IF Sex = MALE AND EYE_COLOR = Yellow AND
Clean integrated data Blood_Test = alcohol_content_HIGH THEN Diagnosis =
using filtration
Liver_cancer_High_risk.

5.2 Decision Tree Algorithm


A decision tree is a classification scheme which generates a
Preprocessing tree and a set of rules, representing the model of different
classes from a given data set item. Decision trees can also be
interpreted as a special form of a rule set, characterized by
their hierarchical organization of rules. Decision tree models
Data mining are best suited for data mining. They are inexpensive to
construct, easy to interpret, easy to integrate with database
system and they have comparable or better accuracy in many
applications. .
Patterns

Symptoms

Knowledge

Less Moderate Heavy


Fig 2: Data Mining Process

5. DATA MINING CLASSIFICATION Age<=45 Age>45


METHODS
The data mining consists of various methods. Different
methods serve different purposes, each method offering its Yes No No Yes
own advantages and disadvantages. In data mining,
classification is one of the most important tasks. It maps the
data into predefined targets. It is a supervised learning as
targets are predefined describes the objects or one attribute to
describe the group of the objects. Then, the classifier is used Low Medium High
to predict the group attributes of new cases from the domain
based on the values of other attributes. We now describe a few
commonly used classification techniques with illustrations of
their applications to liver cancer. Fig 3: A Decision Tree for predicting the risk of
Liver Cancer

112 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security
National Conference on Research Issues in Image Analysis & Mining Intelligence (NCRIIAMI-2015)

easy adaptation to different types of data and input structures,


To classify a particular data item, we start at the root node and fuzzy output values, and generalization for use with multiple
follow the assertions down until we reach a terminal node (or images. Neural networks are of particular interest because
leaf). The nodes and branches are organized in the form of a they offer a means of efficiently modeling large and complex
tree such that, every internal non-leaf node is labeled with problems in which there may be hundreds of predictor
values of the attributes. The branches coming out from an variables that have many interactions. Neural nets may used in
internal node are labeled with values of the attributes in that classification problems (where the output is a categorical
node. The decision tree shown in Fig. 3 is built from the very variable) or for regressions (where the output variable is
small training set (Table 2). In this table each row corresponds continuous).
to a patient record. We will refer to a row as a data instance.
The data set contains three predictor attributes, namely 5.4 Bayesian Network Structure
Severity of symptoms, Age and one goal attribute, namely A Bayesian Network (BN) is a relatively new tool that
risk of the disease whose values (to be predicted from identifies probabilistic correlations in order to make
symptoms) indicates whether the corresponding patient is in predictions or assessments of class membership. A conditional
low or medium or high risk. probability is the likelihood of some conclusion, C, given
some evidence/observation, E, where a dependence
Table 2. Data Set from Fig 3 relationship exists between C and E. This probability is
denoted as P(C | E) where
Decision tree can be used to classify an unknown class data P(E/C).P(C)
instance with the help of the above data set given in the Table P(C/E)
2. The idea is to push the instance down the tree, following P(E)
the branches whose attributes values match the instances
attribute values, until the instance reaches a leaf node, whose
class label is then assigned to the instance. Severity of
Risk of the
For example, the data instance to be classified is described by S no Symptoms
Age Disease
the record (Age=, Severity of symptoms = less, Goal =?), (Liver Cancer)
where ? denotes the unknown value of the goal instance. In
this example, the tree tests the severity of symptom value in 1 Less 62 Medium
the instance. If the answer is less; the instance is pushed down 2 Moderate 54 High
through the corresponding branch and reaches the Age node.
Then the tree tests the Age value in the instance. If the answer 3 Less 18 Low
is 62, the instance is again pushed down through the
4 High 43 High
corresponding branch. Now the instance reaches the terminal
node where the risk of liver cancer disease is in medium risk. 5 Moderate 26 Medium
5.3 Neural Network Architecture
The architecture of the neural network consists of three layers This conditional relationship allows an investigator to gain
such as input layer, hidden layer and output layer is probability information about either C or E with the known
represented in fig4. The nodes in the input layer linked with a outcome of the other.
number of nodes in the hidden layer. Each input node joined
to each node in the hidden layer. The nodes in the hidden
5.5 Support Vector Machine
Support vector machine (SVM) is an algorithm that attempts
layer may connect to nodes in another hidden layer, or to an
to find a linear separator between the data points of two
output layer. The output layer consists of one or more
classes in multidimensional space. SVMs are well suited to
response variables. A main concern of the training phase is to
dealing with interactions among features and redundant
focus on the interior weights of the neural network which
features.
adjusted according to the transactions used in the learning
SVM is a new technique for training classifiers depending on
process. This concept drives us to modify the interior weights.
various functions such as polynomial functions, radial basis
functions, neural networks etc. In Support Vector machines,
the classifier is created using a hyper-linear separating plane.
SVM offers a most excellent solution for issues that cannot be
linearly separated in the input space. The problem is resolved
by making a non-linear transformation of the original input
space into a high dimensional feature space, where an optimal
separating hyper plane is found. A maximal margin classifier
with respect to the training data is obtained when the
separating planes are optimal.
The support vectors are the points which are at the margin and
the solution depends only on these data points. This is the
unique feature of this technique. Linear SVM can be extended
to nonlinear SVM if a feature space uses a group of nonlinear
basis function. The data points can be separated linearly in the
feature space which are very high dimensional. The
fundamental idea of SVM can be described as follows:
Fig 4: Architecture of Neural Network
The neural network has several advantages, including its
nonparametric nature, arbitrary decision boundary capability,

113 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security
National Conference on Research Issues in Image Analysis & Mining Intelligence (NCRIIAMI-2015)

Step 1: The inputs are formulated as feature vectors. Symposium On Computational Intelligence In
Bioinformatics And Computational Biology, Pp.1-7,
Step 2: By using the kernel function, these feature 2005.
vectors are mapped into a feature space. [3] Wenjing Zhang, Donglai Ma, Wei Yao Medical
Step 3: A division is computed in the feature space Diagnosis Data Mining Based on Improved Apriori
to separate the classes of training vectors. Algorithm Journal of Networks , vol. 9, No. 5, May
2014.
5.6 K-Nearest Neighbor Algorithm [4] Liver Disease Prediction Using Bayesian
K-Nearest neighbor algorithm (KNN) is one of the supervised Classification by S.Dhamodharan An International of
learning algorithms that have been used in many applications Advanced Computer Technology2320 0790.
in the field of data mining, statistical pattern recognition and [5] http://ehealth.eletsonline.com/2014/02/50000-people-in-
many others. It follows a method for classifying objects based india-are-diagnosed-with-liver-cancer-each-year/#sthash.
on closest training examples in the feature space. An object is e3 WMWccl.
classified by a majority of its neighbors. K is always a [6] Muhamad Hariz Muhamad Adnan,Wahidah Husain,
positive integer. The neighbors are selected from a set of Nur'Aini Abdul Rashid Data Mining for Medical
objects for which the correct classification is known. The K- Systems: A Review Proc. of the International
Nearest neighbors algorithm is as follows: Conference on Advances in Computer and Information
1. Determine the parameter K i.e., number of nearest Technology - ACIT 2012.
neighbors beforehand. [7] P. Rajeswari,G. Sophia Reena Human Liver Cancer
2. Distance between the query-instance and all the training Classification using Microarray Gene Expression Data,
samples is calculated using any distance measure algorithm. International Journal of Computer Applications (0975
3. Distances for all the training samples are sorted and nearest 8887) Volume 34 No.6,Nov 2011.
neighbor based on the K-th minimum distance is determined. [8] http://www.cancer.org/cancer/livercancer/detailedguide/l
4. Since the K-NN is supervised learning, get all the iver-cancer-sig ns-symptoms
Categories of your training data for the sorted value which fall [9] Gunasundari, Janairaman, A Study of Textural Analysis
under K. Methods for the Diagnosis of Liver Diseases from
5. The prediction value is measured by using the majority of Abdominal Computed Tomography, International
nearest neighbor. Journal of Computer Applications (0975 8887) Volume
74 No.11, July 2013
To classify an unclassified vector X, the KNN algorithm ranks [10] V. Krishnaiah et al,/ (IJCSIT) International Journal of
the neighbors of X amongst a given set of N data (Xi Ci), Computer Science and Information Technologies, Vol. 4
i=1,2N and uses the class labels Cj (j=1,2..k) of the K most (1) , 2013, 39 - 45.
similar neighbors to predict the class of the new vector X. The
classes of these neighbors are weighted using the similarity [11] Huang, YL, Chen JH, Shen WC (2006) Diagnosis of
between X and each of its neighbors measured by the Hepatic Tumors with Texture Analysis in Non enhanced
Euclidean distance metric. Then X is assigned the classlabel Computed Tomography Images Acad Radiol 2006
with the greatest number of votes among K nearest class 13:713720.
labels. [12] Ada et al., International Journal of Advanced Research in
Computer Science and Software Engineering 3(3),
6. CONCLUSION March - 2013, pp. 131-134
Data mining plays a very important role in scientific purpose. [13] Arun K.Pujari, Data Mining Techniques University
With the use of enormous data from volumes of case Press 2001.
recordings in the database, we can detect the liver cancer [14] Han and M. Kamber, 2001. Data Mining: Concepts and
disease using data mining classification techniques. This Techniques. San Francisco, Morgan Kauffmann
paper presents a review of classification techniques in liver Publishers.
cancer. Many methods such as rule set classifiers, decision [15] Jaiwei Han and Micheline Kamber, Data Mining
trees, neural network, K Nearest neighbor and support vector Concepts and TechniquesSecond Edition.
machine (SVM) are discussed. With the performance of these [16] Shelly Gupta et al./ Indian Journal of Computer Science
techniques that are discussed, SVM is the most often used and Engineering (IJCSE) Data mining classification
classifier on liver cancer selection and higher classification techniques applied for breast cancer diagnosis and
accuracy. In future, we will try to increase the accuracy for prognosis Vol. 2 No. 2 Apr-May 2011
detecting liver cancer disease by increasing the various [17] P.Rajeswari, G.Sophia Reena Analysis of Liver
parameters suggested from the doctors. Better classification Disorder Using Data mining Algorithm , Global Journal
techniques can be incorporated and Different ANN transfer of Computer Science and Technology Vol. 10 Issue 14
functions and SVM kernel functions can be used in future (Ver. 1.0) November 2010
research to improve the classifier performance. [18] R. Mallika, And V. Saravanan, "An SVM Based
Classification Method For Cancer Data Using Minimum
7. REFERENCES Microarray Gene Expressions", World Academy Of
[1] World Cancer Report 2014 de Martel C, Ferlay J, Science, Engineering And Technology 62 2010.
Franceschi S, et al. Global burden of cancers attributable [19] N. Revathy And R. Amalraj, "Accurate Cancer
to infections in 2008: a review and synthetic analysis. Classification Using Expressions Of Very Few Genes",
The Lancet Oncology 2012;13: 607-615. International Journal Of Computer Applications, Vol.14,
[2] Huilin Xiong And Xue-Wen Chen,"Optimized Kernel No.4, 2011
Machines For Cancer Classification Using Gene
Expression Data", Proceedings Of The 2005 IEEE

114 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security
National Conference on Research Issues in Image Analysis & Mining Intelligence (NCRIIAMI-2015)

A Study on Ontological Engineering


Dr K. Meenakshisundaram1 K. Ramya2 G. Vijaiprabhu3
1 2 3
Associate Professor, Assistant Professor, Ph.D-Research Scholar,
Department of Computer Science, Department of Computer Science Department of Computer Science,
Erode Arts and Science College Erode Arts and Science College Erode Arts and Science College
Lecturerkms@yahoo.com ramyathangavelme@gmail.com vijaimca@yahoo.com

ABSTRACT Expert Systems Machine Learning


Data Mining plays an important role in many real
time areas like Education, Business, and Government Sectors.
An effective data mining process means discovering
Database KDD Concept
knowledge from hidden data which are residing in large Process Learning
volume of databases, data sets. Ontology is an abstract model
which represents a common model for domain. Ontology Visualization Statistics
takes an explicit specification of a conceptualization. In this
article, we propose the basic concepts about ontology and Fig 1: KDD Process
survey the list of existing research in data mining with
ontology. Ontology is a formal explicit description of 1.1 Motivation for Data Mining
concepts or classes in a domain of discourse is the most Some of the more interesting reasons for data
important part of the knowledge. Among the areas of data mining are, most organization built infrastructural
mining, the problem of deriving knowledge from data has databases. They have gigabytes of data with much
received a great deal of attention. This paper describes the
topics of Ontological Engineering, types of ontology, rules for
hidden information which are not traced easily by using
Ontology designing. SQL query. The reason is the usage of network grow, it
will become increasingly easy to connect databases.
Keywords Machine-learning techniques have expanded
Data Mining, Ontology types, Ontology Uses. enormously. Client/server revolution gives the
individual knowledge worker access to central
1. INTRODUCTION information systems.
Ontologies [3] are comfortable theories about 1.2 Machine Learning and Its Issues
the classes of individuals, properties of individuals, and A good introduction to machine learning is
relations between individuals that are probable in a called concept learning. There is a variety of different
specified domain of knowledge. They define the techniques to enable computers to learn concepts. A
stipulations for describing our knowledge about the very important quality of learning algorithm is that they
domain. Ontology of a domain is beneficial in learn consistent and complete definition. A definition of
establishing a common (controlled) terminology for a concept is complete if it recognizes all the instance of
describing the domain of concentration. This is a concept. A definition of a concept is consistent if it
important for association and sharing of knowledge does not classify any negative examples as falling under
about the domain and linking with other domains. the concept.
Data Mining plays an important role in A very important element in this type of
business organization in order to make the business in learning is the language in which they express the
effective. One of the major motivations for the study of hypothesis that describing the concept. The language
machine learning is much more interesting than could be a specialized computer language like prolog or
Philosophical problems. This is the explosion of data in lisp. There are several issues relating with this
modern society. Most organizations large databases definition. They are,
that contain a wealth of potentially accessible
information. Access the information in large databases Classification accuracy
is very difficult. There is much more confusion For example, take the concept Kangaroo.
between the two terms data mining and knowledge The machine must observe thoroughly the
discovery from data. According to [2], the term KDD concept. It is important that the classification
be employed to describe the whole process of extraction given by the hypothesis is accurate.
of knowledge from data. Knowledge defines Transparency
relationships and patterns between data elements. The It is important that the hypothesis created
term data mining used exclusively in the discovery by our machine-learning program is
stage of the KDD process. The Definition for the term transparent, that is human beings must be able
is the non-trivial extraction of implicit, previously to read it.
unknown and potentially useful knowledge from data.
KDD is a multi disciplinary process. This is shown in Statistical significance
the Fig.1.

115 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security
National Conference on Research Issues in Image Analysis & Mining Intelligence (NCRIIAMI-2015)

A machine learning program is successful if and The D3M methodology has the following
only if it does better than what we call the naive characteristics:
prediction. This conforms us with the notion of the
statistical significance of the results of or learning
Context Based Restriction.
programs Domain Knowledge Integration.
Information Content Cooperation between man and machine.
Information content is closely related to Depth Mining.
statistical significance and transparency. A
program that learns the theory is not very useful. Improve knowledge Discovery.
The theory does not contain any information. Interactive result refinement process.
Support to interactive and parallel mining.
1.3 KDD Process
Knowledge discovery from databases [6] has for 2. ONTOLOGICAL ENGINEERING
objective to make knowledge emerge from hidden
patterns in large amounts of data. It generally relies on In recent years the development of ontology
data mining techniques to identify potentially relevant explicit formal specifications of the terms in the domain
hidden models. The effectiveness of data mining and relations among them [4]has been moving from
depends on appropriate preparation of the data and the realm of Artificial-Intelligence laboratories to the
interpretation of the desktops of domain experts. Ontology has become
Results, which are difficult tasks when dealing with common on the World-Wide Web. The ontology on the
heterogeneous, distributed data from multiple sources. Web range from large taxonomies categorizing Web
Knowledge Discovery process have several steps. They sites (such as on Yahoo!) to categorizations of products
are, for sale and their features (such as on Amazon.com).
Data selection The WWW Consortium (W3C) is developing
Cleaning
the Resource Description Framework, a language for
encoding knowledge on Web pages to make it
Enrichment understandable to electronic agents searching for
Coding information. The Defense Advanced Research Projects
Agency (DARPA), in conjunction with the W3C, is
Data mining developing DARPA Agent Markup Language (DAML)
Reporting by extending RDF with more expressive constructs
The data mining process is done with the transformed aimed at facilitating agent interaction on the Web.
code and they get pattern of their magazine sales. By Many disciplines now develop standardized
evaluating the pattern which is results of the data ontology that domain experts can use to share and
mining, they get the knowledge of about their sales. annotate information in their fields. Medicine, for
There are several levels of analysis are used for data example, has produced large, standardized, structured
mining process. They are, Query Tools, Statistical vocabularies such as SNOMED [5] and the semantic
Techniques, Visualization, OLAP tools, Case-based network of the Unified Medical Language System.
Learning, Decision Trees, Association Rules, and Broad general-purpose ontology is emerging as well.
Neural Networks. For example, the United Nations Development Program
Another research topic in data mining that was and Dun & Bradstreet combined their efforts to develop
identified to be important in [8], is data mining for the UNSPSC ontology which provides terminology for
natural and environmental problems. By developing a products and services
data mining ontology we would be able to join to The Artificial-Intelligence literature contains
natural and environmental domains where the degree of many definitions of ontology: many of these contradict
ontology development is really high. In biology one another. For the purposes of this guide an ontology
domains ontologies have been used for different is a formal explicit description of concepts in a domain
purposes [7]: as controlled vocabularies, for of discourse (classes (sometimes called concepts)),
representing encyclopedic knowledge as a specification properties of each concept describing various features
of an information model (MAGEOM, MAGE-ML, and attributes of the concept (slots (sometimes called
MGED ontology [1]), for specification of a data roles or properties)), and restrictions on slots (facets
interchange format and representation of semantics of (sometimes called role restrictions)). Ontology together
data for information integration. with a set of individual instances of classes constitutes a
knowledge base. In reality, there is a fine line where the
1.4 Data Mining Characteristics ontology ends and the knowledge base begins.
Cao and Zhang [2] proposed a methodology called 2.1 Motivation for Ontology
D3M that considers human knowledge and context Why would someone want to develop ontology?
information on the problem at hand during data mining. Some of the reasons are:

116 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security
National Conference on Research Issues in Image Analysis & Mining Intelligence (NCRIIAMI-2015)

To share common understanding of the domain of concentration. These are most likely to be
nouns (substance) or verbs (associations) in sentences
structure of information among people or
software agents that describe our domain.
That is, deciding what we are going to use the
To enable reuse of domain knowledge ontology for and how detailed or general the ontology is
To make domain assumptions explicit going to be will guide many of the modeling decisions.
To separate domain knowledge from the
3. CONCLUSION
operational knowledge
Data Mining plays an important role in many
To analyze domain knowledge real time areas like Education, Business, and
Government Sectors. An effective data mining process
2.2 Developing an Ontology means discovering knowledge from hidden data which
In practical terms, developing ontology includes: are residing in large volume of databases, data sets.
Defining classes in the ontology, Ontology is an abstract model which represents a
Arranging the classes in a taxonomic
common model for domain. Ontology takes an explicit
specification of a conceptualization.
(subclasssuper class) hierarchy,
In this article, we propose the basic concepts
Defining slots and describing allowed values about ontology and survey the list of existing research
for these slots, in data mining with ontology. Ontology is a formal
Filling in the values for slots for instances. explicit description of concepts or classes in a domain
We can then create a knowledge base by defining of discourse is the most important part of the
individual instances of these classes filling in specific knowledge. Among the areas of data mining, the
slot value information and additional slot restrictions. problem of deriving knowledge from data has received
2.3 Types of Ontology a great deal of attention. This paper describes the topics
Ontologeis can be classified according to the of Ontological Engineering, types of Ontology and rules
degree of conceptualization. The types are, for Ontology design.
Top-level ontology 4. REFERENCES
Domain Ontology
[1]. Ball C.A and Brazma A. MGED standards: work
in progress. Omics: a journal of integrative
Application Ontology biology, 0(2):138144, 2006.
Top Level Ontology describes very common [2]. Cao L and Zhang C, Domain-driven data mining:
concept which are independent of a particular problem A practical methodology, IJDWM, vol. 2, no. 4,
or area. They consider the domain knowledge where the pp. 4965, 2006.
Ontology applied. [3]. Chandrasekaran B, Josephson J.R, and Benjamins
Top level Ontology is relevant across domains and J.R,. What are ontologies, and why do we need
includes vocabulary related to things, events, time, them? IEEE Intelligent Systems, 14(1):2026,
space, etc. 1999.
In Domain ontologies the knowledge represented is [4]. Cromp R.T and Campbell W.J, Gruber , Data
specific to a particular domain such as forestry, fishery, Mining of Multi dimensional remotely sensed
etc. They provide vocabularies about concepts in a images In Prece. Of. Int Conference on
domain and their associations or about the theories Information and Knowledge Management, 1993.
governing the domain. [5]. Faure D and Poibeau T, First Experiences of using
Application or task ontologies describe knowledge Semantic Knowledge Learned by ASIUM for
pieces depending both on a particular domain and task. Information Extraction Task using INTEX, Proc.
Therefore, they are related to problem solving methods. Workshop Ontology Learning, 2000.
2.4 Rules for Ontology Design [6]. Heath T and Bizer C, Linked Data: Evolving the
There are some fundamental rules in ontology Web into a Global Data Space (1st edition)
design to which we will refer many times while Synthesis Lectures on the Semantic Web: Theory
developing. These rules may seem rather inflexible. and Technology, 1:1, 1-136. Morgan & Claypool.
They can help, however, to make design decisions in 2011.
many cases. The three fundamental rules are as follows. [7]. Smith B and Shah N. Ontologies for biomedicine
Rule -1: There is no one correct approach to model a How To make them and use them. Tutorial
domain there are always workable alternatives. The notes at ISMB/ECCB 2007.
best explanation almost always depends on the [8]. Yang Q, and Wu X, 10 challenging problems in
application that we have in mind and the extensions that data mining research. International Journal of
we predict. Information Technology and Decision Making,
Rule -2: Ontology development is an iterative process. 5(4):597604, 2006.
Rule -3: Concepts in the ontology should be close to
substance (physical or logical) and associations in our

117 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
CALL FOR PAPERS
International Journal of Computer Science and Information Security

IJCSIS 2015
ISSN: 1947-5500
http://sites.google.com/site/ijcsis/
International Journal Computer Science and Information Security, IJCSIS, is the premier
scholarly venue in the areas of computer science and security issues. IJCSIS 2011 will provide a high
profile, leading edge platform for researchers and engineers alike to publish state-of-the-art research in the
respective fields of information technology and communication security. The journal will feature a diverse
mixture of publication articles including core and applied computer science related topics.

Authors are solicited to contribute to the special issue by submitting articles that illustrate research results,
projects, surveying works and industrial experiences that describe significant advances in the following
areas, but are not limited to. Submissions may span a broad range of topics, e.g.:

Track A: Security

Access control, Anonymity, Audit and audit reduction & Authentication and authorization, Applied
cryptography, Cryptanalysis, Digital Signatures, Biometric security, Boundary control devices,
Certification and accreditation, Cross-layer design for security, Security & Network Management, Data and
system integrity, Database security, Defensive information warfare, Denial of service protection, Intrusion
Detection, Anti-malware, Distributed systems security, Electronic commerce, E-mail security, Spam,
Phishing, E-mail fraud, Virus, worms, Trojan Protection, Grid security, Information hiding and
watermarking & Information survivability, Insider threat protection, Integrity
Intellectual property protection, Internet/Intranet Security, Key management and key recovery, Language-
based security, Mobile and wireless security, Mobile, Ad Hoc and Sensor Network Security, Monitoring
and surveillance, Multimedia security ,Operating system security, Peer-to-peer security, Performance
Evaluations of Protocols & Security Application, Privacy and data protection, Product evaluation criteria
and compliance, Risk evaluation and security certification, Risk/vulnerability assessment, Security &
Network Management, Security Models & protocols, Security threats & countermeasures (DDoS, MiM,
Session Hijacking, Replay attack etc,), Trusted computing, Ubiquitous Computing Security, Virtualization
security, VoIP security, Web 2.0 security, Submission Procedures, Active Defense Systems, Adaptive
Defense Systems, Benchmark, Analysis and Evaluation of Security Systems, Distributed Access Control
and Trust Management, Distributed Attack Systems and Mechanisms, Distributed Intrusion
Detection/Prevention Systems, Denial-of-Service Attacks and Countermeasures, High Performance
Security Systems, Identity Management and Authentication, Implementation, Deployment and
Management of Security Systems, Intelligent Defense Systems, Internet and Network Forensics, Large-
scale Attacks and Defense, RFID Security and Privacy, Security Architectures in Distributed Network
Systems, Security for Critical Infrastructures, Security for P2P systems and Grid Systems, Security in E-
Commerce, Security and Privacy in Wireless Networks, Secure Mobile Agents and Mobile Code, Security
Protocols, Security Simulation and Tools, Security Theory and Tools, Standards and Assurance Methods,
Trusted Computing, Viruses, Worms, and Other Malicious Code, World Wide Web Security, Novel and
emerging secure architecture, Study of attack strategies, attack modeling, Case studies and analysis of
actual attacks, Continuity of Operations during an attack, Key management, Trust management, Intrusion
detection techniques, Intrusion response, alarm management, and correlation analysis, Study of tradeoffs
between security and system performance, Intrusion tolerance systems, Secure protocols, Security in
wireless networks (e.g. mesh networks, sensor networks, etc.), Cryptography and Secure Communications,
Computer Forensics, Recovery and Healing, Security Visualization, Formal Methods in Security, Principles
for Designing a Secure Computing System, Autonomic Security, Internet Security, Security in Health Care
Systems, Security Solutions Using Reconfigurable Computing, Adaptive and Intelligent Defense Systems,
Authentication and Access control, Denial of service attacks and countermeasures, Identity, Route and
Location Anonymity schemes, Intrusion detection and prevention techniques, Cryptography, encryption
algorithms and Key management schemes, Secure routing schemes, Secure neighbor discovery and
localization, Trust establishment and maintenance, Confidentiality and data integrity, Security architectures,
deployments and solutions, Emerging threats to cloud-based services, Security model for new services,
Cloud-aware web service security, Information hiding in Cloud Computing, Securing distributed data
storage in cloud, Security, privacy and trust in mobile computing systems and applications, Middleware
security & Security features: middleware software is an asset on
its own and has to be protected, interaction between security-specific and other middleware features, e.g.,
context-awareness, Middleware-level security monitoring and measurement: metrics and mechanisms
for quantification and evaluation of security enforced by the middleware, Security co-design: trade-off and
co-design between application-based and middleware-based security, Policy-based management:
innovative support for policy-based definition and enforcement of security concerns, Identification and
authentication mechanisms: Means to capture application specific constraints in defining and enforcing
access control rules, Middleware-oriented security patterns: identification of patterns for sound, reusable
security, Security in aspect-based middleware: mechanisms for isolating and enforcing security aspects,
Security in agent-based platforms: protection for mobile code and platforms, Smart Devices: Biometrics,
National ID cards, Embedded Systems Security and TPMs, RFID Systems Security, Smart Card Security,
Pervasive Systems: Digital Rights Management (DRM) in pervasive environments, Intrusion Detection and
Information Filtering, Localization Systems Security (Tracking of People and Goods), Mobile Commerce
Security, Privacy Enhancing Technologies, Security Protocols (for Identification and Authentication,
Confidentiality and Privacy, and Integrity), Ubiquitous Networks: Ad Hoc Networks Security, Delay-
Tolerant Network Security, Domestic Network Security, Peer-to-Peer Networks Security, Security Issues
in Mobile and Ubiquitous Networks, Security of GSM/GPRS/UMTS Systems, Sensor Networks Security,
Vehicular Network Security, Wireless Communication Security: Bluetooth, NFC, WiFi, WiMAX,
WiMedia, others

This Track will emphasize the design, implementation, management and applications of computer
communications, networks and services. Topics of mostly theoretical nature are also welcome, provided
there is clear practical potential in applying the results of such work.

Track B: Computer Science

Broadband wireless technologies: LTE, WiMAX, WiRAN, HSDPA, HSUPA, Resource allocation and
interference management, Quality of service and scheduling methods, Capacity planning and dimensioning,
Cross-layer design and Physical layer based issue, Interworking architecture and interoperability, Relay
assisted and cooperative communications, Location and provisioning and mobility management, Call
admission and flow/congestion control, Performance optimization, Channel capacity modeling and analysis,
Middleware Issues: Event-based, publish/subscribe, and message-oriented middleware, Reconfigurable,
adaptable, and reflective middleware approaches, Middleware solutions for reliability, fault tolerance, and
quality-of-service, Scalability of middleware, Context-aware middleware, Autonomic and self-managing
middleware, Evaluation techniques for middleware solutions, Formal methods and tools for designing,
verifying, and evaluating, middleware, Software engineering techniques for middleware, Service oriented
middleware, Agent-based middleware, Security middleware, Network Applications: Network-based
automation, Cloud applications, Ubiquitous and pervasive applications, Collaborative applications, RFID
and sensor network applications, Mobile applications, Smart home applications, Infrastructure monitoring
and control applications, Remote health monitoring, GPS and location-based applications, Networked
vehicles applications, Alert applications, Embeded Computer System, Advanced Control Systems, and
Intelligent Control : Advanced control and measurement, computer and microprocessor-based control,
signal processing, estimation and identification techniques, application specific ICs, nonlinear and
adaptive control, optimal and robot control, intelligent control, evolutionary computing, and intelligent
systems, instrumentation subject to critical conditions, automotive, marine and aero-space control and all
other control applications, Intelligent Control System, Wiring/Wireless Sensor, Signal Control System.
Sensors, Actuators and Systems Integration : Intelligent sensors and actuators, multisensor fusion, sensor
array and multi-channel processing, micro/nano technology, microsensors and microactuators,
instrumentation electronics, MEMS and system integration, wireless sensor, Network Sensor, Hybrid
Sensor, Distributed Sensor Networks. Signal and Image Processing : Digital signal processing theory,
methods, DSP implementation, speech processing, image and multidimensional signal processing, Image
analysis and processing, Image and Multimedia applications, Real-time multimedia signal processing,
Computer vision, Emerging signal processing areas, Remote Sensing, Signal processing in education.
Industrial Informatics: Industrial applications of neural networks, fuzzy algorithms, Neuro-Fuzzy
application, bioInformatics, real-time computer control, real-time information systems, human-machine
interfaces, CAD/CAM/CAT/CIM, virtual reality, industrial communications, flexible manufacturing
systems, industrial automated process, Data Storage Management, Harddisk control, Supply Chain
Management, Logistics applications, Power plant automation, Drives automation. Information Technology,
Management of Information System : Management information systems, Information Management,
Nursing information management, Information System, Information Technology and their application, Data
retrieval, Data Base Management, Decision analysis methods, Information processing, Operations research,
E-Business, E-Commerce, E-Government, Computer Business, Security and risk management, Medical
imaging, Biotechnology, Bio-Medicine, Computer-based information systems in health care, Changing
Access to Patient Information, Healthcare Management Information Technology.
Communication/Computer Network, Transportation Application : On-board diagnostics, Active safety
systems, Communication systems, Wireless technology, Communication application, Navigation and
Guidance, Vision-based applications, Speech interface, Sensor fusion, Networking theory and technologies,
Transportation information, Autonomous vehicle, Vehicle application of affective computing, Advance
Computing technology and their application : Broadband and intelligent networks, Data Mining, Data
fusion, Computational intelligence, Information and data security, Information indexing and retrieval,
Information processing, Information systems and applications, Internet applications and performances,
Knowledge based systems, Knowledge management, Software Engineering, Decision making, Mobile
networks and services, Network management and services, Neural Network, Fuzzy logics, Neuro-Fuzzy,
Expert approaches, Innovation Technology and Management : Innovation and product development,
Emerging advances in business and its applications, Creativity in Internet management and retailing, B2B
and B2C management, Electronic transceiver device for Retail Marketing Industries, Facilities planning
and management, Innovative pervasive computing applications, Programming paradigms for pervasive
systems, Software evolution and maintenance in pervasive systems, Middleware services and agent
technologies, Adaptive, autonomic and context-aware computing, Mobile/Wireless computing systems and
services in pervasive computing, Energy-efficient and green pervasive computing, Communication
architectures for pervasive computing, Ad hoc networks for pervasive communications, Pervasive
opportunistic communications and applications, Enabling technologies for pervasive systems (e.g., wireless
BAN, PAN), Positioning and tracking technologies, Sensors and RFID in pervasive systems, Multimodal
sensing and context for pervasive applications, Pervasive sensing, perception and semantic interpretation,
Smart devices and intelligent environments, Trust, security and privacy issues in pervasive systems, User
interfaces and interaction models, Virtual immersive communications, Wearable computers, Standards and
interfaces for pervasive computing environments, Social and economic models for pervasive systems,
Active and Programmable Networks, Ad Hoc & Sensor Network, Congestion and/or Flow Control, Content
Distribution, Grid Networking, High-speed Network Architectures, Internet Services and Applications,
Optical Networks, Mobile and Wireless Networks, Network Modeling and Simulation, Multicast,
Multimedia Communications, Network Control and Management, Network Protocols, Network
Performance, Network Measurement, Peer to Peer and Overlay Networks, Quality of Service and Quality
of Experience, Ubiquitous Networks, Crosscutting Themes Internet Technologies, Infrastructure,
Services and Applications; Open Source Tools, Open Models and Architectures; Security, Privacy and
Trust; Navigation Systems, Location Based Services; Social Networks and Online Communities; ICT
Convergence, Digital Economy and Digital Divide, Neural Networks, Pattern Recognition, Computer
Vision, Advanced Computing Architectures and New Programming Models, Visualization and Virtual
Reality as Applied to Computational Science, Computer Architecture and Embedded Systems, Technology
in Education, Theoretical Computer Science, Computing Ethics, Computing Practices & Applications

Authors are invited to submit papers through e-mail ijcsiseditor@gmail.com. Submissions must be original
and should not have been published previously or be under consideration for publication while being
evaluated by IJCSIS. Before submission authors should carefully read over the journal's Author Guidelines,
which are located at http://sites.google.com/site/ijcsis/authors-notes .
IJCSIS PUBLICATION 2015
ISSN 1947 5500
http://sites.google.com/site/ijcsis/

Вам также может понравиться