Вы находитесь на странице: 1из 50

A SEMINAR/ COLLOQUIUM

REPORT
ON
OFFLINE HANDWRITTEN HINDI CHARACTER RECOGNITION USING DATA
MINING

Submitted in Partial Fulfillment of the Requirements for the


Degree of Master in Computer Applications

SUBMITTED BY
SHRIKRISHNA SHARMA
Roll No.: 1350204016

UNDER THE SUPERVISION OF


Mr. Ashok Kumar
Associate Professor,
Invertis University Bareilly

INVERTIS INSTITUTE OF COMPUTER APPLICATIONS


INVERTIS UNIVERSITY
Invertis Village, Lucknow National Highway 24, Bareilly, Uttar Pradesh 243123

Batch: 2012 - 2015

CERTIFICATE
This is to certify that Mr./Ms. SHRIKRISHNA SHARMA (Roll No. 1350204016),
has carried out the Seminar/ Colloquium work presented in this report entitled
OFFLINE

HANDWRITTEN

USING DATA MINING


Applications from

HINDI

for the award of

CHARACTER
Master

RECOGNITION
of

computer

Invertis University, Bareilly under the supervision of

undersigned.

Ashok Kumar

Ajay indian

Seminar/ Colloquium Supervisor


Associate Professor
Invertis University Bareilly

HOD (IICA)
Associate Professor
Invertis University Bareilly

ACKNOWLEDGEMENT
The real spirit of achieving a goal is through the way of excellence and austerous
discipline. I would have never succeeded in completing my task without the
cooperation, encouragement and help provided to me by various personalities.

First of all, I render my gratitude to the almighty who bestowed self-confidence,


ability and strength in me to complete this work. Without his grace this would never
come to be todays reality. With deep sense of gratitude I express my sincere thanks
to my esteemed and worthy supervisor Mr. Ashok kumar in the Department of
Master of computer application for his valuable guidance in carrying out this work
under his effective supervision, encouragement, enlightenment and cooperation.
Most of the novel ideas and solutions found in this thesis are the result of our
numerous stimulating discussions. His feedback and editorial comments were also
invaluable for writing of this thesis.
I shall be failing in my duties if I do not express my deep sense of gratitude towards
Mr. Ajay indian , Head of computer application Department who has been a constant
source of inspiration for me throughout this work.
I am also thankful to all the staff members of the Department for their full
cooperation and help.

THANK YOU

SHRIKRISHNA SHARMA
1350204016

ABSTRACT

Handwritten Numeral recognition plays a vital role in postal automation services


especially in countries like India where multiple languages and scripts are used Discrete
Hidden Markov
Model (HMM) and hybrid of Neural Network (NN) and HMM are popular methods in
handwritten word recognition system. The hybrid system gives better recognition result
due to better discrimination capability of the NN.
A major problem in handwriting recognition is the huge variability and distortions of
patterns. Elastic models based on local observations and dynamic programming such
HMM are not efficient to absorb this variability. But their vision is local. But they cannot
face to length variability and they are very sensitive to distortions. Then the SVM is used
to estimate global correlations and classify the pattern. Support Vector Machine (SVM) is
an alternative to NN. In Handwritten recognition, SVM gives a better recognition result.
The aim of this paper is to develop an approach which improve the efficiency of
handwritten recognition nusing artificial neural network
Keyword: Handwriting
recognition, Support Vector Machine, Neural Network Advancement in Artificial
Intelligence has lead to the developments of various smart devices. The biggest
challenge in the field of image processing is to recognize documents both in printed and
handwritten format. Character recognition is one of the most widely used biometric traits
for authentication of person as well as document.
Optical Character Recognition (OCR) is a type of document image analysis where
scanned digital image that contains either machine printed or handwritten script input into
an OCR software engine and translating it into an editable machine readable digital text
format. A Neural network is designed to model the way in which the brain performs a
particular task or function of interest. Each image character is comprised of 3020 pixels.
We have applied feature extraction technique for calculating the feature. Features
extracted from characters are directions of pixels with respect to their neighboring pixels.
These inputs are given to a back propagation neural network with hidden layer and output
layer. We have used the Back propagation Neural Network for efficient recognition where
the errors were corrected through back propagation and rectified neuron values were
transmitted by feed-forward method in the neural network of multiple layers.
Handwriting recognition is the ability of a computer to receive and interpret intelligible
handwritten input from sources such as photographs, touch - screens, paper documents
and other devices . Written text image may be sensed "off line" from a piece of paper by
optical scanning (optical character recognition).
Devnagari script has 14 vowels and 33 consonants. Vowels occur either in isolation or in
combination with consonants. Apart from vowels and consonants characters called basic
characters, compound characters are there in Devnagari script, which are formed by
joining
two or more basic characters. Coupled to this in Devnagari script there is a practice of
having twelve forms of modifiers with each for 33 consonants, giving rise to modified
shapes which, depends on whether the modifier is placed to the left, right, top or bottom of
the character. The net result is that there are several thousand different shapes or patterns,
which makes Devnagari OCR more difficult to develop. Here focus is on the recognition
of offline handwritten Hindi characters that can be used in common applications like

commercial forms , bill processing systems , bank cheques, government records, Signature
Verification , Postcode Recognition, , passport readers, offline document recognition
generated by the expanding technological society . In this project , by the use of templet
matching algorithm devnagari script characters are OCR from document images

Table of Contents
1. Cover Page
2. Certificate
3. Abstract
4. Acknowledgements
5. Table of Contents

6. List of Tables
7. List of Figures
8. Introduction
9. Retail review of work and problem statement
10. Proposed approach
11. Solution approach
12. Implementation and Result
13.Future work and conclusion
14.Reffences

List of Tables

Recognition accuracy of handwritten hindi character


Detail recognition performance of svm and UCI datasets
Detail recognition performance of svm and HMM on UCI datasets
Recognition rate of each numeral in datasets

25
45
47
49

List of Figures
HINDI LANGUAGE BASIC CHARACTER SET
CHARACTER RECOGNITION OF THE DOCUMENT IMAGE
OUTPUT IS SAVED IN FORM OF THE TEXT FORMAT
GENRATED 8*8 INPUT MATRICS

16
20
25
28

LOAD SOME ENTRIES FROM THE DIGITS DATASET INTO APPLICATION


USING THE DEFAULT VALUES IN THE APPLICATION
35
WE CAN SEE THE ANALYSIS ALSO PLERFORMS RATHER WELL ON
COMPLETELY NEW AND PRIVIOUSLY UNSEEN DATA.

40

INTRODUCTION
Handwritten Recognition refers to the process of translating images of hand-written,
typewritten, or printed digits into a format understood by user for the purpose of editing,
indexing/searching, and a reduction in storage size. Handwritten recognition system is
having its own importance and it is adoptable in various fields such as online handwriting
recognition on computer tablets, recognize zip codes on mail for postal mail sorting,
processing bank check amounts, numeric entries in forms filled up by hand and so on.
There are two distinct handwriting recognition domains; online and offline, which are
differentiated by the nature of their input signals.

In offline system, static representation of a digitized document is used in applications such


as cheque , form, mail or document processing. On the other hand, online handwriting
recognition (OHR) systems rely on information acquired during the production of the
handwriting. They require specific equipment that allows the capture of the trajectory of
the writing tool. Mobile communication systems such as Personal Digital Assistant (PDA),
electronic pad and smart-phone have online handwriting recognition interface integrated in
them. Therefore, it is important to further improve on the recognition performances for
these applications while trying to constrain space for parameter storage and improving
processing speed. Figure 1 shows an online handwritten Word recognition system. Many
current systems use Discrete Hidden Markov Model based recognizer or a hybrid of
Neural Network (NN) normalization, the writing is usually segmented into basic units
(normally character or part of character) and each segment is classified and labeled. Using
HMM search algorithm in the context of a language model, the most likely word path is
then returned to the user as the intended string .
Segmentation process can be performed in various ways. However, observation
probability for each segment is normally obtained by using a neural network (NN) and a
Hidden Markov Model (HMM) estimates the probabilities of transitions within a resulting
word path. This research aims to investigate the usage of support vector machines (SVM)
in place of NN in a hybrid SVM/HMM recognition system. The main objective is to
further improve the recognition rate[6,7] by using support vector machine (SVM) at the
segment classification level. This is motivated by successful earlier work by Ganapathiraju
in a hybrid SVM/HMM speech recognition (SR) system and the work by Bahlmann [8] in
OHR. Ganapathiraju obtained better recognition rate Compared to hybrid NN/HMM SR
system. In this work, SVM is first developed and used to trazin an OHR system using
character databases. SVM with probabilistic output are then developed for use in the
hybrid system. Eventually, the SVM will be integrated with the HMM module for word
recognition.
Preliminary results of using SVM for character recognition are given and compared with
results using NN reported by Poisson. The following databases were used: IRONOFF,
UNIPEN and the mixture IRONOFF-UNIPEN databases.

The biometrics is most commonly defined as measurable psychological or behavioral


characteristic of the individual that can be used in personal identification and verification.
Character recognition device is one of such smart devices that acquire partial human
intelligence with the ability to capture and recognize various characters in different
languages. Character recognition (in general, pattern recognition) addresses the problem of
classifying input data, represented as vectors, into categories. Character Recognition is a
part of Pattern Recognition [1].
It is impossible to achieve 100% accuracy. The most basic way to recognizing the patterns
using probabilistic methods in which we use Bayesian Network classifiers for recognizing

characters. The need for character recognition software has increased much since the
outstanding growth of the Internet. Optical Character Recognition (OCR) is a very wellstudied problem in the vast area of pattern recognition. Its origins can be found as early as
1870 when an image transmission system was invented which used an array of photocells
to recognize patterns.
Until the middle of the 20th century OCR was primarily developed as an aid to the
visually handicapped. With the advent of digital computers in the 1940s, OCR was
realized as a data processing approach for the first time. The first commercial OCR
systems began to appear in the early 1950s and soon they were being used by the US
postal service to sort mail. The accurate recognition of Latin-script, typewritten text is
now considered largely a solved problem on applications where clear imaging is available
such as scanning of printed documents.
Typical accuracy rates on these exceed 99%. Total accuracy can only be achieved by
human review. Optical Character Recognition (OCR) programs are capable of reading
printed text. This could be text that was scanned in form a document, or hand written text
that was drawn to a hand-held device, such as a Personal Digital Assistant (PDA). The
character recognition software breaks the image into sub-images, each containing a single
character.
The sub-images are then translated from an image format into a binary format, where each
0 and 1 represents an individual pixel of the sub image. The binary data is then fed into a
neural network that has been trained to make the association between the character image
data and a numeric value that corresponds to the character. The output from the neural
network is then translated into ASCII text and saved as a file. Recognition of characters is
a very complex problem. The characters could be written in different size, orientation,
thickness, format and dimension. This will give infinite variations.
The capability of neural network to generalize and insensitive to the [6, 7] missing data
would be very beneficial in recognizing characters. An Artificial Neural Network as the
backend to solve the recognition problem. Neural Network used for training of neural
network. Neural networks have been used in a variety of different areas to solve a wide
range of problems. Unlike human brains that can identify and memorize the characters like
letters or digits; computers treat them as binary graphics. The central objective of this
paper is demonstrating the capabilities of Artificial Neural Network implementations in
recognizing extended sets of image pixel data. In this paper offline recognition of
character is done for this printed text document is used. It is a process by which we
convert printed document or scanned page to ASCII character that a computer can
recognize. A back propagation feed-forward neural network is used to recognize the
characters.
After training the network with back-propagation learning algorithm, high recognition
accuracy can be achieved. Recognition of printed characters is itself a challenging
problem since there is a variation of the same character due to change of fonts or
introduction of different types of noises. Difference in font and sizes makes recognition
task difficult if pre-processing, feature extraction and recognition are not robust. This
paper is organized as follows. Multilayer Perceptron Neural Network for Recognition is

briefly described in Section 2. In section 3, Character recognition procedure is described.


Section 4 training performance and accuracy of prediction is analyzed. Section 4 contains
data description and result analysis.
Hindi handwritten character recognition is the one of the major problem in todays world.
Typed Hindi characters are very difficultly recognized by computer machine therefore
Hindi
handwritten characters are not recognized efficiently and accurately by computer
machine. Many researches have been done to recognize these characters and many
algorithms have been proposed to recognize characters. Many types of software are in the
market for optical Hindi character recognition. For recognizing characters, many
processes have to be
performed. No single process or single machine can perform that recognition. Artificial
neural networks can be used for recognition of characters due to the simplicity of their
design and their universality.
Hindi character recognition is becoming more and more important in the modern world. It
helps human ease their jobs and solve more complex problems. The problem of
recognition of hand-printed characters is still an active area of research. With the
increasing necessity for office automation, it is imperative to provide practical and
effective solutions. All sorts of structural, topological and statistical information has been
observed about the characters does not lend a helping hand in the recognition process due
to different writing styles and moods of persons at the time of writing. Limited variations
in shapes of character are considered.

Literature Survey:Although first research report on handwritten Devnagari characters was published in 1977
[1] but not much research work is done after that. At present researchers have started to
work onhandwritten Devnagari characters and few research reports are published recently.
In this paper, implementation is doneon matlab which allowsmatrixmanipulations, plotting
of functions and data, implementation of algorithms, creation of user interfaces, and
interfacing with programs written in other languages, including C, C++, Java, and
Fortran.Hanmandlu and Murthy [2][3] proposed a Fuzzy model based recognition of
handwritten Hindi numerals and characters and they obtained 92.67% accuracy for
Handwritten Devnagari numerals and 90.65% accuracy for Handwritten Devnagari
characters. Bajajet al [4] employed three different kinds of features namely, density
features, moment features and descriptive component features for classification of
Devnagari Numerals. They proposed multi-classifier connectionist architecture for
increasing the recognition reliability and they obtained 89.6% accuracy for handwritten
Devnagari numerals. Kumar and Singh [5] proposed a Zernike moment feature based
approach for Devnagari handwritten character recognition. They used an artificialneural
network for classification.

OCR is one of the oldest ideas in the history of pattern recognition using computers. In
recent time, Punjabi character recognition becomes the field of practical usage. In
character recognition, the process starts with reading of a scanned image of a series of
characters, determines their meaning, and finally translates the image to a computer
written text document. Mainly, this process is done commonly in the post-offices to
mechanically read names and addresses on envelopes and by the banks to read amount and
number on cheques. Also, companies and civilians can use this method to quickly translate
paper documents to computer written documents. Many researches have been done on
character recognition in last 56 years. Some books [6-8] and many surveys [4, 5] have
been published on the character recognition. Most of the work on character recognition
has been done on Japanese, Latin, Chinese characters in the middle of 1960s. The work by
Impedovo et al. [9] focuses on commercial OCR systems. Jain et al. [10] summarized and
compared some of the well-known methods used in various stages of a pattern recognition
system. They have tried to identify research topics and applications, which are at the
forefront in this field. Pal and Chaudhuri [8] in their report summarized different systems
for Indian language scripts recognition.
They have described some commercial systems like Bangla and Devnagiri OCRs. Manish
[11] in his survey report summarized a system for the recognition of Punjabi characters.
They reported the scope of future work to be extended in several directions such as OCR
for poor quality documents, for multi font OCR and bi- script/multi-script OCR
development etc. A bibliography of the fields of OCR and document analysis is given in
[12]. Tappet et al. [13] and Wakahara et al. [14] worked on-line handwriting recognition
and described a distortion-tolerant shape matching method. Noubound and Plamondon
[15] and Suen et al. [16] proposed methods used for on-line recognition of hand-printed
characters while Connell et al. [17, 18] described on-line character recognition for
Devanagari characters and alphanumeric characters. Bortolozzi et al. [19] have published
a very useful study on recent advances in handwriting recognition. Lee et al. [20]
described off-line recognition of totally unconstrained handwritten numerals using
multiplayer cluster neural network. The character regions are determined by using
projection profiles and topographic features extracted from the gray-scale images. Then, a
nonlinear character segmentation path in each character region is found by using multistage graph search algorithm. Khaly and Ahmed [21], Amin [22] and Lorigo & Govindraju
[23] have produced a bibliography of research on the Arabic optical text recognition.
Hildebrandt and Liu [24] have reported the advances in handwritten Chinese character
recognition and Liu et al. [25] have discussed various techniques used for on-line Chinese
character recognition.
2.1 Indian Script Recognition
As compared to English and Chinese languages, the research on OCR of Indian language
scripts has not achieved that perfection. Few attempts have been carried out on the

recognition of Indian character sets on Devanagari, Bangla, Tamil, Telugu, Oriya,


Gurmukhi, Gujarati and Kannada. These attempts are briefly described in the following
sub-sections.
2.1.1 Recognition of Handwritten Devnagari Scripts
Devnagari is the most popular script in India. Devnagari script is used to write many ndian
languages such as Hindi, Marathi, Rajasthani, Sanskrit and Nepali. The characters of
Hindi Language are shown in figure 9.
The work on Handwritten Devnagari character recognition started early in 1977. Firstly in
1977, I. K. Sethi and B. Chatterjee [26] presented a system for handwritten Devnagari
characters. In this system, sets of very simple primitives were used. Most of the decisions
were taken on the basis of the presence/absence or positional relationship of these 18
primitives. A multistage process was used for taking these decisions. By completion of
each stage, the options for making decision regarding the class membership of the input
token decreases. In 1979, Sinha and Mahabala [27] presented a syntactic pattern analysis
system with an embedded picture language for the recognition of handwritten and
machine printed Devnagari characters. In this system, mainly feature extraction technique
was used. Sethi and Chatterjee [28] also have done some studies on hand-printed
Devnagari numerals which is based upon binary decision tree classifier and that binary
decision tree was made on the basis of presence or absence of some basic primitives,
namely, horizontal line segment, vertical line segment, left and right slant, D-curve, Ccurve, etc. and their positions and interconnections. That decision process was also based
on multistage process. Brijesh K. Verma [29] presented a system for HCR using MultiLayer Perceptron (MLP) networks and the Radial Basis Function (RBF) networks in the
task of handwritten Hindi Character Recognition (HCR). The error back propagation
algorithm was used to train the MLP networks.
2.1.2. Recognition of Bangla Characters
Among all the Indian scripts, the maximum work for recognition of handwritten characters
has been done on Bangla characters. Handwritten Bangla characters are shown in figure
10. For offline handwritten Bangla numerals and characters recognition, some OCR
systems are available in market. In 1982, S. K. Parui and B.B. Chaudhuri et al. [30]
proposed a recognition scheme using a syntactic method for connected Bangla
handwritten numerals. By using some automation some sub-patterns are made on the basis
of these one-dimensional strings of eight direction codes. In 1998, A.F.R. Rahman and M.
Kaykobad [31] proposed a complete Bangali OCR system in which they used hybrid
approach for recognition of handwritten Bangla characters. Everybody have different
writing style. For this purpose, Pal and Chaudhuri [32] proposed a robust scheme for the
recognition of isolated Bangla off-line handwritten numeral. In this scheme, the direction
of numeral, height and position of numeral with respect to the character bounding box,
shape of the reservoir etc. are used for recognition. Dutta and Chaudhuri [34] reported a
work on recognition of isolated Bangla alphanumeric

handwritten characters using neural networks. In this method, the primitives are used for
representing the characters and structural constraints between the primitives imposed by
the junctions present in the characters. Neural network approach is also used by
Bhattacharya et al. [33] for the recognition of Bangla handwritten numeral. In this, certain
features like loops, junctions, etc. present in the graph are considered to classify a numeral
into a smaller group. Sural and Das [35] defined fuzzy sets on Hough transform of
character pattern pixels from which additional fuzzy sets are synthesized using t- norms.
Garain et al. [36] proposed an online handwriting recognition system for Bangla. A low
complexity classifier has been designed and the proposed similarity measure appears to be
quite robust against wide variations in writing styles.
Pal, Wakabayashi and F. Kimura [37] proposed a recognition system for handwritten
offline compound Bangla characters using Modified Quadratic Discriminate Function
(MQDF). The features used for recognition purpose are mainly based on directional
information obtained from the arc tangent of the gradient. To get the feature, at first, a 2 X
2 mean filtering is applied 4 times on the gray level image and non-linear size
normalization is done on the image.
2.1.3 Recognition of Tamil Characters
The work on recognition of Tamil characters started in 1978 by Siromony et al. [38]. They
described a method for recognition of machine-printed Tamil characters using an encoded
character string dictionary. The scheme employs string features extracted by row- and
column- wise scanning of character matrix. Features in each row (column) are encoded
suitably depending upon the complexity of the script to be recognised. Chandrasekaran et
al. [39] used similar approach for constrained hand-printed Tamil character recognition.
Chinnuswamy and Krishnamoorthy [40] presented an approach for hand-printed Tamil
character recognition employing labeled graphs to describe structural composition of
characters in terms of line-like primitives. Recognition is carried out by correlation
matching of the labeled graph of the unknown character with that of the prototypes.
A piece of work on on-line Tamil character recognition is reported by Aparna et al. [41].
They used shape-based features including dot, line terminal, bumps and cusp. Comparing
an unknown stroke with a database of strokes does stroke identification. Finite state
automation has been used for character recognition with an accuracy of 71.32-91.5%.
2.1.4 Recognition of Telugu Characters
A two-stage recognition system for printed Telugu alphabets has been described by
Rajasekaran and Deekshatulu [42]. In the first stage a directed curve tracing method is
employed to recognize primitives and to extract basic character from the actual character
pattern. In the second stage, the basic character is coded, and on the basis of the
knowledge of the primitives and the basic character present in the input pattern, the
classification is achieved by means of a decision tree. Lakshmi and Patvardhan [43]
presented a Telugu OCR system for printed text of multiple sizes and multiple fonts.

After pre-processing, connected component approach is used for segmentation characters.


Real valued direction features have been used for neural network based
recognition system. The authors have claimed an accuracy of 98.6%. Negi et al. [2]
presented a system for printed Telugu character recognition, using connected components
and fringe distance based template matching for recognition. Fringe distances compare
only the black pixels and their positions between the templates and the input images.
2.1.5 Recognition of Gurmukhi Characters
Gurmukhi script is used primarily for writing Punjabi language. Punjabi Language is
spoken by eighty four million native speakers and is the worlds 14th most widely spoken
language. Lehal and Singh [30] developed a complete OCR system for printed Gurmukhi
script where connected components are first segmented using thinning based approach.
They started work with discussing useful pre-processing techniques. Lehal and Singh [30]
have discussed in detail the segmentation problems for Gurmukhi script. They have
observed that horizontal projection method, which is the most commonly used method
employed to extract the lines from the document, fails in many cases when applied to
Gurmukhi text and results in over segmentation or under segmentation. The text image is
broken into horizontal text strips using horizontal projection in each row. The gaps on the
horizontal projection profiles are taken as separators between the text strips. Each text
strip could represent: a) Core zone of one text line consisting of upper, middle zone and
optionally lower zone (core strip), b) upper zone of a text line (upper strip), c) lower zone
of a text line (lower strip), d) core zone of more than one text line (multi strip). Then using
estimated average height of the core strip and its percentage they identify the type of each
strip. The classification process is carried out in three stages. In the first stage, the
characters are grouped into three sets depending on their zonal position, i.e., upper zone,
middle zone and lower zone. In the second stage, the characters in middle zone set are
further distributed into smaller sub-sets by a binary decision tree using a set of robust and
font independent features. In the third stage, the nearest neighbor classifier is used and the
special features distinguishing the characters in each subset are used. This enhances the
computational efficiency. The system has an accuracy of about 97.34%. An OCR
postprocessor of Gurmukhi script is also developed. In last, Lehal and Singh and Lehal et
al. Proposed a post-processor for Gurmukhi OCR where statistical information of
Punjabi language syllable combinations and certain heuristics based on Punjabi grammar
rules have been considered. There is also some literature dealing with segmentation of
Gurmukhi Script. Lehal and Singh have performed segmentation of Gurmukhi script by
connected component analysis of a word assuming the headline not being a part of the
word. Goyal et al. have suggested a dissection based Gurmukhi character segmentation
method, which segments the characters in the different zones of a word by examining the
vertical white space. Manish [11] proposed an algorithm for recognizing Gurumukhi
scripts. In his work he recognized Punjabi characters with the efficiency of 92.56 %. In
Chinese, Latin the efficiency of recognition of words is over 99%.

HINDI LANGUAGE: A REVIEW :Hindi is an Indo-Aryan language and is one the official languages of India. It is the
worlds third most commonly used language after Chinese and English and has
approximately 500 million speakers all over the world. It is written in Devnagari script. It
is written from left to right along a horizontal line. The basic character set has 13
SWARS (vowel) and 33 VYANJANS (consonants) shown in the figure.

Figure 1: Hindi language basic character set

DEVNAGARI SCRIPT :Hindi is worlds third most commonly used language after Chinese and English, and there
are approximately 500 billion people all over the world who speak and write in Hindi. It
is the basic script of many languages in India, such as Hindi and Sanskrit. It is
indisputable that Devnagari has the most accurate scientific basis. For a long time, it has
been script of Indian Aryan languages. It is even now used by Sanskrit, Hindi, Marathi and
Nepali languages. Hindi is the worlds widely spoken language and since its script is

Devnagari, so its the most popular script. As Hindi has been declared the national
language by constitution of Indian, Devnagari has got the status of national dialect.
In the beginning, Hindi was declared as the state language and Devnagari as the state
script of other major states such as Himachal, Haryana, Rajasthan, Madhya Pradesh,
Bihar, Uttaranchal, etc. Presently, it is found that Devnagari script is the most scientific
script. Since every script is developed from Brahmi script so, Devnagari has connection
with almost every other script. In Devnagari, all letters are equal, i.e. there is no concept of
capital or small letters. Devnagari is half syllabic in nature.

Optical Character Recognition (Ocr):OCR is the acronym for Optical Character Recognition. This technology allows a machine
to
automatically recognize characters through an optical mechanism. Many objects are
recognized by human beings in this manner. Optical mechanism are the Eyes while the
brain sees the input, according to many factors the ability to appreciatethese signals
varies in each person. Reviewing the variables, the challenges faced by the technologist
developing an OCR system can be understood easily. Documents are in the form of papers
which the human can read and understand but it is not possible for the computer to
understand these documents directly.
In order to convert these documents into computer process able form, OCR systems are
developed. OCR is the process of converting scanned images of machine printed or
handwritten text, numerals, letters and symbols into a computer process able format such
as ASCII. OCR is an area of pattern recognition and processing of handwritten character is
motivated largely by desire to improve man and machine communication.

Proposed Algorithm:The system performs character recognition by exploring the feature of templates matching
for its ability to recognize handwritten Hindi characters.
The following steps are followed:
1- A database of Hindi handwritten character is created in different handwritings from
different peoples.
2- Preprocessing of training image.
a) Binarization of image using function bw= im2bw(Ibw,level).
b) Edge detection of image using function iedge=edge(uint8(BW2)).
c) Dilation
of
image
using
function
se = strel(square,2) ; iedge2=imdilate(iedge,se) .
d) Region filling of image using function ifill = imfill(iedge2,holes).
e) Character detection in image using [Ilabel num] = bwlabel(Ifill);
Iprops = regionprops(Ilabel);
Ibox = [Iprops.BoundingBox];

Ibox = reshape(Ibox,[4 num]);


3 - Extraction and Scaling the normalized characters to 50*50 scale using boundary value
analysis
img{cnt} = imcrop(Ibw,Ibox(:,cnt)); bw2 = imgcrop(img{cnt}); charvec =
imresize(bw2,[50 50]);
4 - Templates generation using image averaging and saving templates in templates.mat file
which is used in matching phase.
5- Binarizing test image and matching it with templates and generating a result.txt file
containing recognized characters.
The system performs character recognition by exploring the feature of templates matching
for its ability to recognize handwritten Hindi characters.
The scope of the proposed system is limited to the recognition of a single character.

Scanning :Handwritten character data samples are acquired on paper from various
people. These data samples are then scanned from paper through an optically digitizing
device such as optical scanner or camera. A flat-bed scanner is used at 300dpi which
converts the data on the paper being scanned into a bitmap image.

DETAIL REVIEW OF WORK AND PROBLEM STATEMENT


DESIGNING OF MULTILAYER NEURAL NETWORK FOR RECOGNITION
There are two basic methods used for OCR: Matrix matching and feature extraction. Of
the two ways to recognize characters, matrix matching is the simpler and more common.
But still we have used Feature Extraction to make the product more robust and accurate.
Feature Extraction is much more versatile than matrix matching. Here we use Matrix
matching for Recognition of character. The Process of Character Recognition of the
document image mainly involves six phases:
A
cquisition of Grayscale Image
D
igitization/Binarization
Line and Boundary Detection
Feature Extraction
Feed Forward Artificial Neural Network based
Matching.
Recognition of Character based on matching score.

Fig. 2.

The scanned image must be [4, 5] a grayscale image or binary image, where binary image
is a contrast stretched grayscale image. That grayscale image is then undergoes
digitization. In digitization [12] a rectangular matrix of 0s and 1s are formed from the
image. Where 0-black and 1-white and all RGB values are converted into 0s and 1s.The
matrix of dots represents two dimensional arrays of bits.
Digitization is also called binarization as it converts grayscale image into binary image
using adaptive threshold. Line and Boundary detection is the process of identifying points
in a digital image at which the character top, bottom, left and right are calculated. Feed
Forward Neural Network approach is used to combine all the unique features, which are
taken as inputs, one hidden layer is used to integrate and collaborate[9] similar features
and if required adjust the inputs by adding or subtracting weight values, finally one output
layer is used to find the overall matching score of the III. CHARACTER RECOGNITION
PROCEDURE

Pre-processing:- The pre-processing stage yields a clean document in the sense that
maximal shape information with maximal compression and minimal noise on normalized
image is obtained.
Segmentation: - Segmentation is an important stage because the extent one can reach in
separation of words, lines or characters directly affects the recognition rate of the script.
Feature extraction:- After segmenting the character, extraction of feature like height,
width, horizontal line, vertical line, and top and bottom detection is done.
Classification:- For classification or recognition back propagation algorithm is used.
Output:-Output
is
saved
in
form
of
text
format.

TRAINING ALGORITHM PERFORMANCE AND ACCURACY OF


PREDICTION The back propagation algorithm requires a numerical representation
for the characters. Learning is implemented using the back-propagation algorithm with
learning rate. Gradient is calculated [10], after every iteration and compared with
threshold gradient value. If gradient is greater than the threshold value then it performs
next iteration. The batch steepest descent training function is trained. The weights and
biases are updated in the direction of the negative gradient of the performance function. In
order to determine quantitatively the model, two error measures is employed for
evaluation and model comparison, being these: The model squared error (MSE) and the
mean absolute error (MAE). If yi is the actual observation for a time period t and Ft is the
forecast for the same period, then the error is defined as ( (1 = The standard
statistical error measures can be defined as MSE= n i n i e 1 1 n 1 = = (2) And the mean
absolute error as MSE= = n i i e n 1 1 where n is the number of periods of time .When

the mean square error decreased gradually and became stable, and the training and testing
error produced satisfactory results .the training performance curve of neural network.
The accuracy of the trained network is tested against output data. The accuracy of the
trained network is assessed in the following way: in first way, the predicted output value is
compared with the measured values .The results are presented shows the relative accuracy
of the predicted output. The overall percentage error obtained from the tested results is
4%. In the second way, the root mean square error and the mean absolute error are
determined and compared. The performance index for training of ANN is given in terms
of mean square error (MSE).The tolerance limit for the MSE is set to 0.001.The MSE of
the training set become stable at 0.0070 when the number of iteration reaches 350.
The closeness of the training and the testing errors validates the accuracy of the model. V.
EXPERIMENTAL RESULTS We create interface for proposed system for character
recognition by using Microsoft Visual C # 2008 Express Editions. The MLP network that
is implemented is composed of three layers input layer, output layer and hidden layer. The
input layer constitutes of 180 neurons which receive printed image data from a 30x20
symbol pixel matrix. The hidden layer constitutes of 256 neurons whose [12] number is
decided on the basis of optimal results on a trial and error basis. The output layer is
composed of 16 neurons. Number of characters=90, Learning rate=150, No of neurons in
hidden layer=256 TABLE I: PERCENTAGE OF ERROR FOR DIFFERENT EPOCHS

2. Existing Techniques
2.1 Modified discrimination function (MQDF) Classifier
G. S. Lehal and Nivedan Bhatt [10] designed a recognition system for handwritten
Devangari Numeral using Modified discrimination function (MQDF) classifier. A
recognition rate and a confusion rate were obtained as 89% and 4.5% respectively.

2.2 Neural Network on Devenagari Numerals


R. Bajaj, L. Dey, S. Chaudhari [11] used neural network based classification scheme.
Numerals were represented by feature vectors of three types. Three different neural
classifiers had been used for classification of these numerals. Finally, the outputs of
the three classifiers were combined using a connectionist scheme. A 3-layer MLP was
used for implementing the classifier for segment-based features. Their work produced
recognition rate of 89.68%.

2.3 Gaussian Distribution Function

R. J. Ramteke et.al applied classifiers on 2000 numerals images obtained from different
individuals of different professions. The results of PCA, correlation coefficient and
perturbed moments are an experimental success as compared to MIs. This research
produced 92.28% recognition rate by considering 77 feature dimensions.

2.4 Fuzzy classifier on Hindi Numerals


M. Hanmandlu, A.V. Nath, A.C. Mishra and V.K. Madasu used fuzzy membership
function for recognition of Handwritten Hindi Numerals and produce 96% recognition
rate. To recognize the unknown numeral set, an exponential variant of fuzzy membership
function was selected and it was constructed using the normalized vector distance.
2.5 Multilayer Perceptron
Ujjwal Bhattacharya, B. B. Chaudhuri [11] used a distinct MLP classifier. They worked on
Devanagari, Bengali and English handwritten numerals. A back propagation (BP)
algorithm was used for training the MLP classifiers. It provided 99.27% and 99.04%
recognition accuracies on the original training and test sets of Devanagari numeral
database, respectively.
2.6 Quadratic classifier for Devanagari Numerals
U. Pal, T. Wakabayashi, N. Sharma and F. Kimura [14] developed a modified quadratic
classifier for recognition of offline handwritten numerals of six popular Indian scripts; viz.
They had used 64 dimensional features for high-speed recognition. A five-fold cross
validation technique has been used for result computation and obtained 99.56% accuracy
from Devnagari scripts, respectively.

PROPOSED APPROACH
3.1 Support Vector Machine (SVM)
SVM in its basic form implement two class classifications. It has been used in recent years
as an alternative to popular methods such as neural network. The advantage of SVM, is
that it takes into account both experimental data and structural behavior for better
generalization capability based on the principle of structural risk minimization (SRM). Its
formulation approximates SRM principle by maximizing the margin of class separation,
the reason for it to be known also as large margin classifier. The basic SVM formulation is
for linearly separable datasets.
It can be used for nonlinear datasets by indirectly mapping the nonlinear inputs into to
linear feature space where the maximum Margin decision function is approximated. The
mapping is done by using a kernel function. Multi class classification can be performed by
modifying the 2 class scheme. The objective of recognition is to interpret a sequence of
numerals taken from the test set. The architecture of proposed system is given in fig. 3.The
SVM (binary classifier) is applied to multi class numeral recognition problem by using
one-versus-rest type method.
The SVM is trained with the training samples using linear kernel.
Classifier performs its function in two phases; Training and Testing. [29] After
preprocessing and Feature Extraction process, Training is performed by considering the
feature vectors which are stored in the form of matrices. Result of training is used for
testing the numerals. Algorithm for Training is given in algorithm.
3.2 Statistical Learning Theory
Support Vector Machines have been developed by Vapnik in the framework of Statistical
Learning Theory [13]. In statistical learning theory (SLT), the problem of classification in
supervised learning is formulated as follows: We are given a set of l training data and its
class, {(x1,y1)...(xl,yl)} in Rn R sampled according to unknown joint probability
distribution P(x,y) characterizing how the classes are spread in Rn R. To measure the
performance of the classifier, a loss function L(y,f(x)) is defined as follows: L(y,f(x)) is
zero if f classifies x correctly, one otherwise. On average, how f performs can be described
by the Risk functional: ERM principle states that given the training set and a set of
possible classifiers in the hypothesis space F, we Should choose f F that minimizes
Remp(f). However, which generalizes well to unseen data due to over fitting phenomena.
Remp(f) is a poor, over-optimistic approximation of R(f), the true risk. Neural network
classifier relies on ERM principle.
The normal practice to get a more realistic estimate of generalization error, as in neural
network is to divide the available data into training and test set. Training set is used to find
a Classifier with minimal empirical error (optimize the weight of an MLP neural
networks) while the test set is used to find the generalization error (error rate on the Test
set). If we have different sets of classifier hypothesis space F1, F2 e.g. MLP neural
networks with different topologies, we can select a classifier from each hypothesis space

(each topology) with minimal Remp(f) and choose the final classifier with minimal
generalization error. However, to do that requires designing and training potentially large
number of individual classifiers. Using SLT, we do not need to do that. Generalization
error can be directly minimized by minimizing an upper bound of the risk functional R(f).
The bound given below holds for any distribution P(x,y) with probability of at least 1- :
where the parameter h denotes the so called VC (Vapnik-Chervonenkis) dimension. is
the
confidence term defined by Vapnik [10] as : ERM is not sufficient to find good classifier
because even with small Remp(f), when h is large compared to l, will be large, so R(f)
will also be large, ie: not optimal. We actually need to minimize Remp(f)and at the same
time, a process which is called structural risk Minimization (SRM). By SRM, we do not
need test set for model selection anymore.
Taking different sets of classifiers F1, F2 with known h1, h2 we can select f from
one of the set with minimal Remp(f), compute and choose a classifier with minimal
R(f).No more evaluation on test set needed, at least in theory. However, we still have to
train potentially very large number of individual classifiers. To avoid this, we want to
make h tunable (ie: to cascade a potential classifier Fi with VC dimension = h and choose
an optimal f from an optimal Fi in a single optimization step. This is done in large margin
classification.
3.3 SVM formulations
SVM is realized from the above SLT framework. The simplest formulation of SVM is
linear, where the decision hyper plane lies in the space of the input data x. In this case the
hypothesis space is a subset of all hyper planes of the form: f(x) = wx +b. SVM finds an
optimal hyper plane as the solution to the learning Problem which is geometrically the
furthest from both classes since that will generalize best for future unseen data.
There are two ways of finding the optimal decision hyper plane. The first is by finding a
plane that bisects the two closest points of the two convex hulls defined by the set of
points of each class, as shown in figure 2. The second is by maximizing the margin
between two supporting planes as shown in figure 3. Both methods will produce the same
optimal decision plane and the same set of points that support the solution (the closest
points on the two convex hulls in figure 2 or the points on the two parallel supporting
planes in figure 3). These are called the support vectors.
4. Feature Extraction
4.1 Moment Invariants
The moment invariants (MIs) [1] are used to evaluate seven distributed parameters of a
numeral image. In any character Recognition system, the characters are processed to
extract features that uniquely represent properties of the character. Based on normalized
central moments, a set of seven moment invariants is derived. Further, the resultant image
was thinned and seven moments were extracted. Thus we had 14 features (7 original and 7
thinned), which are applied as features for recognition using Gaussian Distribution

Function. To increase the success rate, the new features need to be extracted by applying
Affine Invariant Moment method.
4.2 Affine Moment Invariants
The Affine Moment Invariants were derived by means of the theory of algebraic
invariants. Full derivation and comprehensive discussion on the properties of invariants
can be found. Four features can be computed for character recognition. Thus overall 18
features have been used for Support Vector Machine.

5. Experiment
5.1 Data Set Description
In this paper, the UCI Machine learning data set are used. The UCI Machine Learning
Repository is a collection of databases, domain theories, and data generators that are used
by the machine learning community for the empirical analysis of machine learning
algorithms. One of the available datasets is the Optical Recognition of the Handwritten
Digits Data Set.
The dataset of handwritten assamese characters by collecting samples from 45 writers is
created. Each writer contributed 52 basic characters, 10 numerals and 121 assamese
conjunct consonants. The total number of entries corresponding to each writer is 183 (= 52
characters + 10 numerals + 121 conjunct consonants). The total number of samples in the
dataset is 8235 ( = 45 183 ). The handwriting samples were collected on an iball 8060U
external digitizing tablet connected to a laptop using its cordless digital stylus pen. The
distribution of the dataset consists of 45 folders.
This file contains information about the character id (ID), character name (Label) and
actual shape of the character (Char). In the raw Optdigits data, digits are represented as
32x32 matrices. They are also available in a pre- processed form in which digits have been
divided into non-overlapping blocks of 4x4 and the number of on pixels have been
counted in each block. This generated 8x8 input matrices where each element is an integer
in the range 0.16.

5.2 Data Preprocessing


For the experiments using SVM, example isolated characters are preprocessed and 7 local
features for each point of the spatially resample online signal were extracted. For each
example character there are 350 feature values as input to the SVM. We use SVM with
RBF kernel, since RBF kernel has been shown to generally give better recognition result.
Grid search was done to find the best values for the C and gamma (representing in the
original RBF kernel formulation) for the final SVM models and by that, C = 8 and gamma
= were chosen.

Preprocessing :A series of operations are performed on scanned image during preprocessing (figure 4).
Scanned image
The operations that are performed during preprocessing are:
(i) Applied median filtering to reduce noise from the introduced to the character image
during scanning. It is usually taken from a template centered on the point of interest. To
perform median filtering at a point values of the pixel and its neighbors are sorted into
order based upon their gray levels and their median is determined [12].
(ii) Global thresholding is applied to convert image from gray scale to binary form.
(iii) Image is normalized into 7X7.
(iv) Thinning is performed by the method proposed in [10].

The back-propagation algorithm works as follows


(i)
All the weights are initialized to some small random values.
(ii)
Input vector and desired output vectors are presented to the net.
(iii)
Each input unit receives an input signal and transmits this signal to each hidden
unit.
(iv)
Each hidden unit calculates the activation function and sends the signal to each
output unit.
(v)
Actual output is calculated. Each output unit compares the actual output with
the desired output to determine the associated error with that unit.
(vi)
Weights are adjusted to minimize the error.
In this paper, the proposed backpropagation neural net is designed two hidden
layers as shown in figure 7. The input layer contained total 49 nodes as 49
features were extracted for each character. The output layer contained total 5
nodes (1 node for each class). Size of the output was 5X5, where each
character is represented as a 5X1 output vector. Number of nodes in both
hidden layers was set to 7.

PROPOSED WORK
Character recognition task has been attempted through many different approaches like
template matching, statistical techniques like NN, HMM, Quadratic Discriminant function
(QDF) etc. Template matching works effectively for recognition of standard fonts, but
gives poor performance with handwritten characters and when the size of dataset grows. It
is not an effective technique if there is font discrepancy . HMM models achieved great
success in the field of speech recognition in past decades, however developing a 2-D
HMM model for character recognition is found difficult and complex. NN is found very
computationally expensive in recognition purpose . N. Araki et al. applied Bayesian filters
based on Bayes Theorem for handwritten character recognition. Later, discriminative
classifiers such as Artificial Neural Network (ANN) and Support Vector Machine (SVM)
grabbed a lot of attention. In G. Vamvakas et al. compared the performance of three
classifiers: Naive Bayes, K-NN and SVM and attained best performance with SVM.
However SVM suffers from limitation of selection of kernel. ANNs can adapt to changes
in the data and learn the characteristics of input signal .Also, ANNs consume less storage
and computation than SVMs. Mostly used classifiers based on ANN are MLP and RBFN.
B.K. Verma presented a system for HCR using MLP and RBFN networks in the task of
handwritten Hindi character recognition. The error back propagation algorithm was used
to train the MLP networks. J. Sutha et al. in showed the effectiveness of MLP for Tamil
HCR using the Fourier descriptor features. R. Gheroie et al. in proposed handwritten Farsi
character recognition using MLP trained with error back propagation algorithm. Computer
Science & Information Technology (CS & IT) 27 similar shaped characters are difficult to
differentiate because of very minor variations in their structures. In T.
Wakabayashi et al. proposed an F-Ratio (Fisher Ratio) based feature extraction method to
improve results of similar shaped characters. They considered pairs of similar shaped
characters of different scripts like English, Arabic/Persian, Devnagri, etc. and used QDF
for recognition purpose. QDF suffers from limitation of minimum required size of dataset.
F. Yang et al. in [14] proposed a method that combines both structural and statistical
features of characters for similar handwritten Chinese character recognition.
As it can be seen that various feature extraction methods and classifiers have been used for
character recognition by researchers that are suitable for their work, we propose a novel
feature set that is expected to perform well for this application. In this paper, the features
are extracted on the basis of character geometry, which are then fed to each of the selected
ML algorithms for recognition of SSHMC. 3.Methodology for feature extraction A device
is to be designed and trained to recognize the 26 letters of the alphabet.
We assume that some imaging system digitizes each letter centered in the systems field of
vision. The result is that each letter is represented as a 5 by 7 grid of real values. The
following figure shows the perfect pictures of all 26 letters. Figure 1: The 26 letters of
the alphabet with a resolution of 5 7. However, the imaging system is not perfect and the
letters may suffer from noise: Figure 2: A perfect picture of the lettar A and 4 noisy
versions (stabdard devistion of 0.2).

Perfect classification of ideal input vectors is required, and more important reasonably
accurate classification of noisy vectors. Before OCR can be used, the source material must
be scanned using an optical scanner (and sometimes a specialized circuit board in the PC)
to read in the page as a bitmap (a pattern of dots). Software to recognize the images is also
required. The character recognition software then processes these scans to differentiate
between images and text and determine what letters are represented in the light and dark
areas. Older OCR systems match these images against stored bitmaps based on specific
fonts. The hit-or-miss results of such pattern-recognition systems helped establish OCR's
reputation for inaccuracy. Today's OCR engines add the multiple algorithms of neural.

SOLUTION APPROACH
On-line handwriting recognition involves the automatic conversion of text as it is written
on a special digitizer or PDA where a sensor picks up the pen-tip movements as well as
pen-up/pen-down switching. This kind of data is known as digital ink and can be regarded
as a digital representation of handwriting. The obtained signal is converted into letter
codes which are usable within computer and text-processing applications.
The elements of an on-line handwriting recognition interface typically include:

a pen or stylus for the user to write with.

a touch sensitive surface, which may be integrated with, or adjacent to, an output
display.

a software application which interprets the movements of the stylus across the

writing surface, translating the resulting strokes into digital text.


General process
The process of online handwriting recognition can be broken down into a few general
steps:

preprocessing,

feature extraction and

classification.

The purpose of preprocessing is to discard irrelevant information in the input data, that can
negatively affect the recognitio. This concerns speed and accuracy. Preprocessing usually
consists of binarization, normalization, sampling , smoothing and denoising. The second
step is feature extraction. Out of the two- or more-dimensional vector field received from
the preprocessing algorithms, higher-dimensional data is extracted. The purpose of this
step is to highlight important information for the recognition model. This data may include
information like pen pressure, velocity or the changes of writing direction. The last big
step is classification. In this step various models are used to map the extracted features to
different classes and thus identifying the characters or words the features represent.

Hardware
Commercial products incorporating handwriting recognition as a replacement for
keyboard input were introduced in the early 1980s. Examples include handwriting
terminals such as the Pencept Penpad and the Inforite point-of-sale terminal.With the
advent of the large consumer market for personal computers, several commercial products
were introduced to replace the keyboard and mouse on a personal computer with a single
pointing/handwriting system, such as those from PenCept, CIC and others. The first
commercially available tablet-type portable computer was the GRiDPad from grid system,
released in September 1989. Its operating system was based on MS-DOS.
In the early 1990s, hardware makers including NCR, IBM and EO released tablet
computers running the PenPoint operating system developed by GO Corp. PenPoint used
handwriting recognition and gestures throughout and provided the facilities to third-party
software. IBM's tablet computer was the first to use the ThinkPad name and used IBM's
handwriting recognition. This recognition system was later ported to Microsoft Windows
for Pen Computing, and IBM's Pen for OS/2. None of these were commercially
successful.
Advancements in electronics allowed the computing power necessary for handwriting
recognition to fit into a smaller form factor than tablet computers, and handwriting
recognition is often used as an input method for hand-held PDAs. The first PDA to
provide written input was the Apple Newton, which exposed the public to the advantage
of a streamlined user interface.
However, the device was not a commercial success, owing to the unreliability of the
software, which tried to learn a user's writing patterns. By the time of the release of
the Newton OS 2.0, wherein the handwriting recognition was greatly improved, including
unique features still not found in current recognition systems such as modeless error
correction, the largely negative first impression had been made. After discontinuation
of Apple Newton, the feature has been ported to Mac OS X 10.2 or later in form
of Inkwell (Macintosh) Palm later launched a successful series of PDAs based on
the Graffiti recognition system. Graffiti improved usability by defining a set of
"unistrokes", or one-stroke forms, for each character. This narrowed the possibility for
erroneous input, although memorization of the stroke patterns did increase the learning
curve for the user.
The Graffiti handwriting recognition was found to infringe on a patent held by Xerox, and
Palm replaced Graffiti with a licensed version of the CIC handwriting recognition which,

while also supporting unistroke forms, pre-dated the Xerox patent. The court finding of
infringement was reversed on appeal, and then reversed again on a later appeal. The
parties involved subsequently negotiated a settlement concerning this and other
patents Graffiti (Palm OS).

A Tablet PC is a special notebook computer that is outfitted with a digitizer tablet and a
stylus, and allows a user to handwrite text on the unit's screen. The operating system
recognizes the handwriting and converts it into typewritten text.
Windows Vista and Windows 7 include personalization features that learn a user's writing
patterns or vocabulary for English, Japanese, Chinese Traditional, Chinese Simplified and
Korean. The features include a "personalization wizard" that prompts for samples of a
user's handwriting and uses them to retrain the system for higher accuracy recognition.
This system is distinct from the less advanced handwriting recognition system employed
in its Windows Mobile OS for PDAs.
Although handwriting recognition is an input form that the public has become accustomed
to, it has not achieved widespread use in either desktop computers or laptops. It is still
generally accepted that keyboard input is both faster and more reliable. As of 2006, many
PDAs offer handwriting input, sometimes even accepting natural cursive handwriting, but
accuracy is still a problem, and some people still find even a simple on-screen keyboard
more efficient.

Software
Initial software modules could understand print handwriting where the characters were
separated. Author of the first applied pattern recognition program in 1962 was Shelia
Guberman, then in Moscow. Commercial examples came from companies such as
Communications Intelligence Corporation and IBM. In the early 90s, two companies,
ParaGraph International, and Lexicus came up with systems that could understand cursive
handwriting recognition. ParaGraph was based in Russia and founded by computer
scientist Stepan Pachikov while Lexicus was founded by Ronjon Nag and Chris Kortge
who were students at Stanford University. The ParaGraph CalliGrapher system was
deployed in the Apple Newton systems, and Lexicus Longhand system was made
available commercially for the PenPoint and Windows operating system.

Lexicus was acquired by Motorola in 1993 and went on to develop Chinese handwriting
recognition and predictive text systems for Motorola. ParaGraph was acquired in 1997 by
SGI and its handwriting recognition team formed a P&I division, later acquired from SGI
by Vadem. Microsoft has acquired CalliGrapher handwriting recognition and other digital
ink technologies developed by P&I from Vadem in 1999. Wolfram Mathematica (8.0 or
later) also provides a handwriting or text recognition function TextRecognize.
Character recognition task has been attempted through many different approaches like
template matching, statistical techniques like NN, HMM, Quadratic Discriminant function
(QDF) etc. Template matching works effectively for recognition of standard fonts, but
gives poor performance with handwritten characters and when the size of dataset grows. It
is not an effective technique if there is font discrepancy .
HMM models achieved great success in the field of speech recognition in past decades,
however developing a 2-D HMM model for character recognition is found difficult and
complex . NN is found very computationally expensive in recognition purpose. N. Araki et
al. applied Bayesian filters based on Bayes Theorm for handwritten character recognition.
Later, discriminative classifiers such as Artificial Neural Network (ANN) and Support
Vector Machine (SVM) grabbed a lot of attention.
In G. Vamvakas et al. compared the performance of three classifiers: Naive Bayes, K-NN
and SVM and attained best performance with SVM. However SVM suffers from
limitation of selection of kernel. ANNs can adapt to changes in the data and learn the
characteristics of input signal. Also, ANNs consume less storage and computation than
SVMs . Mostly used classifiers based on ANN are MLP and RBFN. B.K. Verma [10]
presented a system for HCR using MLP and RBFN networks in the task of handwritten
Hindi character recognition.
The error back propagation algorithm was used to train the MLP networks. J. Sutha et al.
in showed the effectiveness of MLP for Tamil HCR using the Fourier descriptor features.
R. Gheroie et al. in proposed handwritten Farsi character recognition using MLP trained
with error back propagation algorithm. Computer Science & Information Technology (CS
& IT) 27 Similar shaped characters are difficult to differentiate because of very minor
variations in their structures. In T.
Wakabayashi et al. proposed an F-Ratio (Fisher Ratio) based feature extraction method to
improve results of similar shaped characters. They considered pairs of similar shaped
characters of different scripts like English, Arabic/Persian, Devnagri, etc. and used QDF
for recognition purpose. QDF suffers from limitation of minimum required size of dataset.
F. Yang et al. in proposed a method that combines both structural and statistical features
of characters for similar handwritten Chinese character recognition.
As it can be seen that various feature extraction methods and classifiers have been used
for character recognition by researchers that are suitable for their work, we propose a
novel feature set that is expected to perform well for this application. In this paper, the
features are extracted on the basis of character geometry, which are then fed to each of the
selected ML algorithms for recognition of SSHHC.

3. MACHINE LEARNING CONCEPTS


Machine learning [15] is a scientific discipline that deals with the design and development
of
algorithms that allow computers to develop behaviours based on empirical data. ML
algorithms, in this application, are used to map the instances of the handwritten character
samples to predefined classes.

3.1. Machine Learning Algorithms


For this application of SSHHC recognition, we use the below mentioned ML algorithms
that have been implemented in WEKA 3.7.0: WEKA (Waikato Environment for
Knowledge Analysis) is JAVA based open source simulator. These algorithms have been
found performing very well for most of the applications and have been widely used by
researchers. Brief description of these algorithms is as follows:
3.1.1. Bayesian Network
A Bayesian Network [17] or a Belief Network is a probabilistic model in the form of
directed
acyclic graphs (DAG) that represents a set of random variables by its nodes and their
correlations by its edges. Bayesian Networks has an advantage that they visually represent
all the relationships between the variables in the system via connecting arcs. Also, they
can handle situations where the data set is incomplete.
3.1.2. Radial Basis Function Network
An RBFN is an artificial neural network which uses radial basis functions as activation
functions. Due to its non-linear approximation properties, RBF Networks are able to
model
complex mapping. RBF Networks do not suffer from the issues of local minima because
the
parameters required to be adjusted are the linear mappings from hidden layer to output
layer.
3.1.3. Multilayer Perceptron
An MLP is a feed forward artificial neural network that computes a single output from
multiple real-valued inputs by forming a linear combination according to its input weights
and then putting the output through some nonlinear activation function (mainly Sigmoid).
MLP is a universal function approximator and is highly efficient for solving complex
problems due to the presence of more than one hidden layer.
3.1.4. C4.5

C4.5 is an extension of Ross Quinlan's earlier ID3 algorithm. It builds decision trees from
a
set of training data using the concept of information gain and entropy. C4.5 uses a white
box
28 Computer Science & Information Technology (CS & IT) model due to which the
explanation of results is easy to understand. Also, it performs well with even large amount
of data.

3.2. Feature Reduction


Feature extraction is the task to detect and isolate various desired attributes (features) of an
object in an image, which maximizes the recognition rate with the least amount of
elements. However, training the classifier with maximum number of features obtained is
not always the best option, as the irrelevant or redundant features can cause negative
impact on a classifiers performance . and at the same time, the build classifier can be
computationally complex. Feature reduction or feature selection is the technique of
selecting a subset of relevant features to improve the performance of learning models by
speeding up the learning process and reducing computational complexities. Two feature
reduction methods that have been chosen for this application are CFS [22] and CON [23]
as these methods have been widely used by researchers for feature reduction.
4. EXPERIMENTAL METHODOLOGY
The following sections describe the data-set, pre-processing and feature extraction adopted
in our proposed work of recognition of SSHHC.
4.1. Dataset Creation
Dataset is created by asking the candidates of different age groups to write the similar
shaped
characters (, ; , ; , ) several times in their handwriting on white plain sheets. These
image samples are scanned using HP Scanjet G2410 with resolution of 1200 x 1200 dpi.
Each character is cropped and stored in .jpg format using MS Paint. Thus this dataset
consists of
isolated handwritten Hindi characters that are to be recognized using ML algorithms.
Using these character samples, three datasets are created as described below:
Dataset 1 consists of only 100 samples of the target pair (, ).
Dataset 2 consists of increased samples of the same target pair, i.e. size of training dataset
is
increased to 342 samples by adding more samples of the target class (from other persons)
in the training dataset. More samples are added in order to analyze the impact of increase
in number of samples on the relative performance of ML algorithms.

Dataset 3 consists of samples of both the target and non-target class, i.e. other similar
shaped
character pairs (like , ; , ) are also added to the dataset (making 500 samples in total)
with
which the ML algorithms are trained. Non-target class characters are added to test the
ability of ML classifiers for target characters among different characters. A few samples of
the entire
dataset are shown in Figure 1.
4.2. Performance Metrics
Performance of the classifiers is evaluated on the basis of the metrics described below:
i Precision: Proportion of the examples which truly have class x among all those which
were
classified as class x. Figure. 1 Samples of Handwritten Hindi Characters Computer
Science & Information Technology (CS & IT) 29
ii Misclassification Rate: Number of instances that were classified incorrectly out of the
total
instances.
iii Model Build Time: Time taken to train a classifier on a given data set.
4.3. Pre-processing
Following pre-processing steps are applied to the scanned character images:
i First each RGB character image, after converting to gray scale, is binarized through
thresholding.
ii The image is inverted such that the background is black and foreground is white.
iii Then shortest matrix that fits the entire character skeleton for each image is obtained
and
this is termed as universe of discourse.
iv Finally, the spurious pixels are removed from the image followed by skeletonization.
4.4. Feature Extraction
After pre-processing, features for each character image are extracted based on the
character
geometry using the technique described in [24]. The features are based on the basic line
types that form the skeleton of the character.
Each pixel in the image is traversed. Individual line segments, their directions and
intersection points are identified from an isolated character image. For this, the image
matrix is initially divided into nine zones and the number, length and type of lines and
intersections present in each zone are determined. The line type can be: Horizontal,
Vertical, Right Diagonal and Left Diagonal. For each zone following features are
extracted. It results into a feature vector of length 9 for each zone:
i. Number of horizontal lines
ii. Number of vertical lines
iii. Number of Right diagonal lines
iv. Number of Left diagonal lines

v. Normalized Length of all horizontal lines


vi. Normalized Length of all vertical lines
vii. Normalized Length of all right diagonal lines
viii. Normalized Length of all left diagonal lines
ix. Number of intersection points.
A total of 81 (9x9) zonal features are obtained. After zonal feature extraction, four
additional
features are extracted for the entire image based on the regional properties namely:
i. Euler Number: It is defined as the difference of Number of Objects and Number of holes
in the image
ii. Eccentricity: It is defined as the ratio of the distance between the centre of the ellipse
and
its major axis length
iii. Orientation: It is the angle between the x-axis and the major axis of the ellipse that has
the same second-moments as the region
iv. Extent: It is defined as the ratio of pixels in the region to the pixels in the total
bounding box

IMPLEMENTATION AND RESULT


5.4 Experimental Results
5.4.1 Test application Analysis
The test application accompanying the source code can perform the recognition of
handwritten digits. To do so, open the application (preferably outside Visual Studio, for
better performance). Click on the menu File and select Open. This will load some entries
from the Optdigits dataset into the application.

To perform the analysis, click the Run Analysis button. Please be aware that it may take
some

time. After the analysis is complete, the other tabs in the sample application will be
populated with theof each factor found during the discriminant analysis is plotted in a pie
graph for easy visual inspection. Once the analysis is complete, we can test its
classification ability in the testing data set.
The green rows have been correctly identified by the discriminant space Euclidean
distance classifier. We can see that it correctly identifies 98% of the testing data. The
testing and training data set are disjoint and independent. analysis' information.

Fig.6: Using the default values in the application

Results
After the analysis has been completed and validated, we can use it to classify the new
digits
drawn directly in the application. The bars on the right show the relative response for each
of the discriminant functions. Each class has a discriminant function that outputs a
closeness measure for the input point. The classification is based on which function
produces the maximum output.
Handwritten Devanagari Character sets are taken from test .bmp image. Steps are
followed to obtain best accuracy of input handwritten Hindi character image given to the
system. First of all, training of system is done by using different data set or sample. And
then system is tested for few of the given sample, and accuracy is measured. For each
character, feature were computed and stored in templates for training the system.
The sets of handwritten Gurumukhi characters are made. The data set was partitioned into
two parts. The first one is used for training the system and the second one for testing. For
each character, features were computed and stored for training the network. Three network
layers, i.e. one input layer, one hidden layer and one output layer are taken. If number of
neurons in the hidden layer is increased, then a problem of allocation of required memory
is occurred. Also, if the value of error tolerance is high, say 0.1, desired results are not
obtained, so changing the value of error tolerance i.e. say 0.01, high accuracy rate is
obtained. Also the network takes more number of cycles to learn when the error tolerance
value is less rather than in the case of high value of error tolerance in which network
learns in less number of cycles and so the learning is not very fine. The unit disk is taken
for each character by finding the maximum radius of the character (i.e. the maximum
distance between the center of the character and the boundary of the character), so that the
character could fit on the disk.
Here are some tables displaying the results obtained from the program. Sign images of the
same letter are grouped together on every table. The table gives us information about the
pre-processing operations that took place (i.e. noise, edge detection, filling of gap) and
also if the image belongs to the same database with the training images. The amount of
each filter is also recorded so maximum values of noise can be estimated that the network
can tolerate. This of course varies from character image to character image. The result also

varies for every time the algorithm is executed. The variance is very small but it is there.
Following are main results of Gurumukhi character recognition: -

Table 4: Recognition Accuracy of Handwritten Hindi Characters

Character

No. of Samples
200

Train/Test
180/20

% Accuracy
93%

196
155
184
192
160

176/20
130/25
169/15
162/30
140/20

87%
89%
71%
69%
81%

179

159/20

79%

168
195
177
191
180
195
187
169
199
188
166
196
189
168
178
196
171
182
184
169
180
170
193
185
176

148/20
170/25
152/25
166/25
165/15
170/25
167/20
149/20
174/25
168/20
146/20
176/20
164/25
148/20
158/20
176/20
151/20
162/20
164/20
149/20
155/25
150/20
173/20
165/20
146/30

84%
80%
90%
88%
86%
89%
96%
95%
92%
94%
82%
82%
88%
85%
84%
87%
81%
88%
80%
89%
76%
78%
71%
82%
70%

167
157
178
183
191
185

147/20
132/25
158/20
153/30
161/30
155/30

92%
85%
87%
69%
73%
70%

Fig 7: We can see the analysis also performs rather well on completely new and
previously unseen data.

Experiments were performed on different samples having mixed scripting languages on


numerals using single hidden layer.

Table 2: Detail Recognition performance of SVM on UCI datasets

Table 3: Detail Recognition performance of SVM and HMM on UCI datasets

Table 4: Recognition Rate of Each Numeral in DATASET

It is observed that recognition rate using SVM is higher than Hidden Markov Model.
However, free parameter storage for SVM model is significantly higher. The memory
space required for SVM will be the number of support vectors multiply by the number of
feature values (in this case 350). This is significantly large compared to HMM which only
need to store the weight. HMM needs less space due to the weight-sharing scheme.
However, in SVM, space saving can be achieved by storing only the original online
signals and the penup/ pen-down status in a compact manner. During recognition, the
model will be expanded dynamically as required. Table 3 shows the comparison of
recognition rates between HMM and SVM using all three databases. SVM clearly
outperforms in all three isolated character cases.
The result for the isolated character cases above indicates that the recognition rate for the
hybrid word recognizer could be improved by using SVM instead of HMM. Thus, we are
currently implementing word recognizer using both HMM and SVM and comparing their
performance.

FUTURE WORK AND CONCLUSION


CONCLUSION- The important feature of this ANN training is that the learning rates are
dynamically computed each epoch by an interpolation map. The ANN error function is
transformed into a lower dimensional error space and the reduced error function is
employed to identify the variable learning rates. As the training progresses the geometry
of the ANN error function constantly changes and therefore the interpolation map always
identifies variable learning rates that gradually reduce to a lower magnitude.
As a result the error function also reduces to a smaller terminal function value. The result
of structure analysis shows that if the number of hidden nodes increases the number of
epochs taken to recognize the handwritten character is also increases. A lot of efforts have
been made to get higher accuracy but still there are tremendous scopes of improving
recognition accuracy. REFERE
6. Conclusion Handwriting recognition is a challenging field in the machine learning and
this work identifies Support Vector Machines as a potential solution. The number of
support vectors can be reduced by selecting better C and gamma parameter values through
a finer grid search and by reduced set selection Work on integrating the SVM character
recognition framework into the HMM based word recognition framework is on the way. In
the hybrid system, word preprocessing and normalization needs to be done before SVM is
then used for character hypothesis recognition and word likelihood computation using
HMM. It is envisaged that, due to SVMs better discrimination capability, word
recognition rate will be better than in a HMM hybrid system.
The scope of the proposed system is limited to the recognition of a single character.Offline
handwritten Hindi character recognition is a difficult problem, not only because of the
great amount of variations in human handwriting, but also, because of the overlapped and
joined characters. Recognition approaches heavily depend on the nature of the data to be
recognized. Since handwritten Hindi characters could be of various shapes and size, the
recognition process needs to be much efficient and accurate to recognize the characters
written by different users.There are few reasons that create problem in Hindi handwritten
character recognition. Some characters are similar in shape (for example and ).

Sometimes characters are overlapped and joined. Large numbers of character and stroke
classes are present there. Different, or even the same user can write differently at different
times, depending on the pen or pencil, the width of the line, the slight rotation of the paper,
the type of paper and the mood and stress level of the person.The character can be written
at different location on paper or in window Characters can be written in different fonts.

Handwritten Gurumukhi character recognition using neural networks is discussed


here. It has been found that recognition of handwritten Gurumukhi characters is a
very difficult task. Following are main reasons for difficulty in recognition of
Gurumukhi characters: Some Gurumukhi characters are similar in shape (for example and ).
Different, or even the same writer can write differently at different times,
depending on the pen or pencil, the width of the line, the slight
rotation of the paper, the type of paper and the mood and stress level
of the person.
The character can be written at different location on paper or in window
Characters can be written in different fonts.
These facts are justified by the work done here. A small set of all Gurumukhi
characters using back propagation neural network is trained, then testing was
performed on other character set. The accuracy of network was very low. Then,
some other character images in the old character set are added and trained the
network using new sets. Then again testing was performed on some new image sets
written by different people, and it was found that accuracy of the network increases
slightly in some cases. Again some new character images into old character set are
added (on which network was trained) and trained the network using this new set.
The network is presented new character images and it has been seen that

recognition increases, although at a slow rate. The result of the last training by 50
character set and testing with the 10 character set are presented. It can be concluded
that as the network is trained with more number of sets, the accuracy of recognition
of characters will increase definitely.

Future scope
Over the past three decades, many different methods have been explored by a large
number of scientists to recognize characters. A variety of approaches have been
proposed and tested by researchers in different parts of the world, including
statistical methods, structural and syntactic methods and neural networks. No OCR
in this world is 100% accurate till date. The recognition accuracy of the neural
networks proposed here can be further improved. The number of character set used
for training is reasonably low and the accuracy of the network can be increased by
taking more training character sets. This approach of recognition is used for
recognition of Gurumukhi characters only. In future work, this can be implemented
for recognition of Gurumukhi words.

REFERENCES
[1] R. Plamondon and S. N. Srihari, On-line and off-line handwritten
recognition: a comprehensive survey, IEEE Transactions on PAMI, Vol. 22(1),
pp. 6384, 2000.
[2] Negi, C. Bhagvati and B. Krishna, An OCR system for Telugu, in the
Proceedings of the Sixth International Conference on Document Processing,
pp.1110-1114, 2001.
[3]
Hong, J.I. and Landay, J.A. SATIN: A Toolkit for Informal Ink-based
Applications. CHI Letters: ACM Symposium on UIST, 2 (2), 63-72.
[4]
S. Mori, C. Y. Suen and K. Yamamoto, Historical review of OCR research
and development, Proceedings of the IEEE, Vol. 80(7), pp. 1029-1058, 1992.
[5]
U. Pal and B. B. Chaudhuri, Indian script character recognition, Pattern
Recognition, Vol. 37(9), pp. 1887-1899, 2004.
[6]
H. Bunke and P. S. P. Wang, Handbook of Character Recognition and
Document Image Analysis, World Scientific Publishing Company, 1997.
[7]
Stephen V. Rice, George Nagy and Thomas A. Nartker, Optical Character
Recognition: An Illustrated Guide to the Frontier, Kluwer Academic Publications,
1999.
[8]
S. Mori, H. Nishida and H. Yamada, Optical Character Recognition, John
Wiley & Sons, 1999.
[9]
S. Impedovo, L. Ottaviano and S. Occhinegro, Optical character
recognition, International Journal Pattern Recognition and Artificial Intelligence,
Vol. 5(1-2), pp. 1-24, 1991.
[10] A. K. Jain, R. P. W. Duin and J. Mao, Statistical pattern recognition: a
review,IEEE Transactions on PAMI, Vol. 22(1), pp. 4-37, 2000.
[11] Manish Kumar, Degraded text recognition of gurmukhi scripts,Dspace,
Thapar University, Patiala.

[12] R. Kasturi and L. OGorman, Document image analysis: a bibliography,


Machine Vision and Applications, Vol. 5(3), pp. 231-243, 1992.
[13] C. C Tappert, C. Y. Suen and T. Wakahara, The state of the art in on-line
handwriting recognition, IEEE Transactions on PAMI, Vol. 12(8), pp. 787-808,
1990
[14] T. Wakahara, H. Murase and K. Odaka, On-line handwriting recognition,
Proceedings of the IEEE, Vol. 80(7), pp. 1181-1194, 1992.
[15] F. Nouboud and R. Plamondon, On-line recognition handprinted characters:
beta tests, Pattern Recognition, Vol. 23(9), pp. 1031-1044, 1990.
[16] C. Y. Suen, M. Berthod and S. Mori, Automatic recognition of hand printed
characters- the state of the art, Proceedings of the IEEE, Vol. 68(4), pp. 469-487.
1980
[17] S. D. Connell, R. M. K. Sinha and A. K. Jain, Recognition of unconstrained
on-line Devanagari characters, in the Proceedings of 15th International
Conference on Pattern Recognition (ICPR), Vol. 2, Spain, pp. 368-371, 2000.

[18] S. D. Connell and A. K. Jain, Template-based online character recognition,


Pattern Recognition, Vol. 34(1), pp. 1-14, 2001.
[19] F. Bortolozzi, A. Britto Jr., L. S.Oliveria and M. Morita, Recent advances in
handwriting recognition, in the Proceedings of International Workshop on
Document Analysis (IWDA), India, pp. 1-30, 2005.
[20] S. W. Lee, Off-line recognition of totally unconstrained handwritten
numerals Using multiplayer cluster neural network, IEEE Transactions on PAMI,
Vol 18(6), pp.648-652, 1996.
[21] F. El-Khaly and M. A. Sid-Ahmed, Machine recognition of optically
captured
machine printed Arabic text, Pattern
Recognition, Vol. 23(11), pp. 1207-1214, 1990.
[22] A. Amin, Off-line Arabic character recognition, in the Proceedings of
ICDAR, pp. 596-599, 1997.
[23] L. M. Lorigo and V. Govindaraju, Offline Arabic handwriting recognition,
IEEE Transactions on PAMI, Vol. 28(5), pp. 712-724, 2006.
[24] T. H. Hildebrandt and W. Liu, Optical recognition of handwritten Chinese
characters: Advances since 1980, Pattern Recognition, Vol. 26(2), pp. 205-225,
1993.
[25]

C. L. Liu, S. Jaeger and Masaki Nakagawa, Online recognition of Chinese


characters: the state-of-the-art, IEEE Transactions on PAMI, Vol. 26(2),
pp. 198-213. 2004
[26] I. K. Sethi and B. Chatterjee, Machine Recognition of constrained Handprinted Devnagari, Pattern Recognition, Vol. 9, pp. 69-75, 1977.
[27] R.M.K. Sinha, H. Mahabala, Machine recognition of Devnagari script, IEEE

Trans. Systems Man Cybern. 9 (1979) 435441.

Вам также может понравиться