Академический Документы
Профессиональный Документы
Культура Документы
REPORT
ON
OFFLINE HANDWRITTEN HINDI CHARACTER RECOGNITION USING DATA
MINING
SUBMITTED BY
SHRIKRISHNA SHARMA
Roll No.: 1350204016
CERTIFICATE
This is to certify that Mr./Ms. SHRIKRISHNA SHARMA (Roll No. 1350204016),
has carried out the Seminar/ Colloquium work presented in this report entitled
OFFLINE
HANDWRITTEN
HINDI
CHARACTER
Master
RECOGNITION
of
computer
undersigned.
Ashok Kumar
Ajay indian
HOD (IICA)
Associate Professor
Invertis University Bareilly
ACKNOWLEDGEMENT
The real spirit of achieving a goal is through the way of excellence and austerous
discipline. I would have never succeeded in completing my task without the
cooperation, encouragement and help provided to me by various personalities.
THANK YOU
SHRIKRISHNA SHARMA
1350204016
ABSTRACT
commercial forms , bill processing systems , bank cheques, government records, Signature
Verification , Postcode Recognition, , passport readers, offline document recognition
generated by the expanding technological society . In this project , by the use of templet
matching algorithm devnagari script characters are OCR from document images
Table of Contents
1. Cover Page
2. Certificate
3. Abstract
4. Acknowledgements
5. Table of Contents
6. List of Tables
7. List of Figures
8. Introduction
9. Retail review of work and problem statement
10. Proposed approach
11. Solution approach
12. Implementation and Result
13.Future work and conclusion
14.Reffences
List of Tables
25
45
47
49
List of Figures
HINDI LANGUAGE BASIC CHARACTER SET
CHARACTER RECOGNITION OF THE DOCUMENT IMAGE
OUTPUT IS SAVED IN FORM OF THE TEXT FORMAT
GENRATED 8*8 INPUT MATRICS
16
20
25
28
40
INTRODUCTION
Handwritten Recognition refers to the process of translating images of hand-written,
typewritten, or printed digits into a format understood by user for the purpose of editing,
indexing/searching, and a reduction in storage size. Handwritten recognition system is
having its own importance and it is adoptable in various fields such as online handwriting
recognition on computer tablets, recognize zip codes on mail for postal mail sorting,
processing bank check amounts, numeric entries in forms filled up by hand and so on.
There are two distinct handwriting recognition domains; online and offline, which are
differentiated by the nature of their input signals.
characters. The need for character recognition software has increased much since the
outstanding growth of the Internet. Optical Character Recognition (OCR) is a very wellstudied problem in the vast area of pattern recognition. Its origins can be found as early as
1870 when an image transmission system was invented which used an array of photocells
to recognize patterns.
Until the middle of the 20th century OCR was primarily developed as an aid to the
visually handicapped. With the advent of digital computers in the 1940s, OCR was
realized as a data processing approach for the first time. The first commercial OCR
systems began to appear in the early 1950s and soon they were being used by the US
postal service to sort mail. The accurate recognition of Latin-script, typewritten text is
now considered largely a solved problem on applications where clear imaging is available
such as scanning of printed documents.
Typical accuracy rates on these exceed 99%. Total accuracy can only be achieved by
human review. Optical Character Recognition (OCR) programs are capable of reading
printed text. This could be text that was scanned in form a document, or hand written text
that was drawn to a hand-held device, such as a Personal Digital Assistant (PDA). The
character recognition software breaks the image into sub-images, each containing a single
character.
The sub-images are then translated from an image format into a binary format, where each
0 and 1 represents an individual pixel of the sub image. The binary data is then fed into a
neural network that has been trained to make the association between the character image
data and a numeric value that corresponds to the character. The output from the neural
network is then translated into ASCII text and saved as a file. Recognition of characters is
a very complex problem. The characters could be written in different size, orientation,
thickness, format and dimension. This will give infinite variations.
The capability of neural network to generalize and insensitive to the [6, 7] missing data
would be very beneficial in recognizing characters. An Artificial Neural Network as the
backend to solve the recognition problem. Neural Network used for training of neural
network. Neural networks have been used in a variety of different areas to solve a wide
range of problems. Unlike human brains that can identify and memorize the characters like
letters or digits; computers treat them as binary graphics. The central objective of this
paper is demonstrating the capabilities of Artificial Neural Network implementations in
recognizing extended sets of image pixel data. In this paper offline recognition of
character is done for this printed text document is used. It is a process by which we
convert printed document or scanned page to ASCII character that a computer can
recognize. A back propagation feed-forward neural network is used to recognize the
characters.
After training the network with back-propagation learning algorithm, high recognition
accuracy can be achieved. Recognition of printed characters is itself a challenging
problem since there is a variation of the same character due to change of fonts or
introduction of different types of noises. Difference in font and sizes makes recognition
task difficult if pre-processing, feature extraction and recognition are not robust. This
paper is organized as follows. Multilayer Perceptron Neural Network for Recognition is
Literature Survey:Although first research report on handwritten Devnagari characters was published in 1977
[1] but not much research work is done after that. At present researchers have started to
work onhandwritten Devnagari characters and few research reports are published recently.
In this paper, implementation is doneon matlab which allowsmatrixmanipulations, plotting
of functions and data, implementation of algorithms, creation of user interfaces, and
interfacing with programs written in other languages, including C, C++, Java, and
Fortran.Hanmandlu and Murthy [2][3] proposed a Fuzzy model based recognition of
handwritten Hindi numerals and characters and they obtained 92.67% accuracy for
Handwritten Devnagari numerals and 90.65% accuracy for Handwritten Devnagari
characters. Bajajet al [4] employed three different kinds of features namely, density
features, moment features and descriptive component features for classification of
Devnagari Numerals. They proposed multi-classifier connectionist architecture for
increasing the recognition reliability and they obtained 89.6% accuracy for handwritten
Devnagari numerals. Kumar and Singh [5] proposed a Zernike moment feature based
approach for Devnagari handwritten character recognition. They used an artificialneural
network for classification.
OCR is one of the oldest ideas in the history of pattern recognition using computers. In
recent time, Punjabi character recognition becomes the field of practical usage. In
character recognition, the process starts with reading of a scanned image of a series of
characters, determines their meaning, and finally translates the image to a computer
written text document. Mainly, this process is done commonly in the post-offices to
mechanically read names and addresses on envelopes and by the banks to read amount and
number on cheques. Also, companies and civilians can use this method to quickly translate
paper documents to computer written documents. Many researches have been done on
character recognition in last 56 years. Some books [6-8] and many surveys [4, 5] have
been published on the character recognition. Most of the work on character recognition
has been done on Japanese, Latin, Chinese characters in the middle of 1960s. The work by
Impedovo et al. [9] focuses on commercial OCR systems. Jain et al. [10] summarized and
compared some of the well-known methods used in various stages of a pattern recognition
system. They have tried to identify research topics and applications, which are at the
forefront in this field. Pal and Chaudhuri [8] in their report summarized different systems
for Indian language scripts recognition.
They have described some commercial systems like Bangla and Devnagiri OCRs. Manish
[11] in his survey report summarized a system for the recognition of Punjabi characters.
They reported the scope of future work to be extended in several directions such as OCR
for poor quality documents, for multi font OCR and bi- script/multi-script OCR
development etc. A bibliography of the fields of OCR and document analysis is given in
[12]. Tappet et al. [13] and Wakahara et al. [14] worked on-line handwriting recognition
and described a distortion-tolerant shape matching method. Noubound and Plamondon
[15] and Suen et al. [16] proposed methods used for on-line recognition of hand-printed
characters while Connell et al. [17, 18] described on-line character recognition for
Devanagari characters and alphanumeric characters. Bortolozzi et al. [19] have published
a very useful study on recent advances in handwriting recognition. Lee et al. [20]
described off-line recognition of totally unconstrained handwritten numerals using
multiplayer cluster neural network. The character regions are determined by using
projection profiles and topographic features extracted from the gray-scale images. Then, a
nonlinear character segmentation path in each character region is found by using multistage graph search algorithm. Khaly and Ahmed [21], Amin [22] and Lorigo & Govindraju
[23] have produced a bibliography of research on the Arabic optical text recognition.
Hildebrandt and Liu [24] have reported the advances in handwritten Chinese character
recognition and Liu et al. [25] have discussed various techniques used for on-line Chinese
character recognition.
2.1 Indian Script Recognition
As compared to English and Chinese languages, the research on OCR of Indian language
scripts has not achieved that perfection. Few attempts have been carried out on the
handwritten characters using neural networks. In this method, the primitives are used for
representing the characters and structural constraints between the primitives imposed by
the junctions present in the characters. Neural network approach is also used by
Bhattacharya et al. [33] for the recognition of Bangla handwritten numeral. In this, certain
features like loops, junctions, etc. present in the graph are considered to classify a numeral
into a smaller group. Sural and Das [35] defined fuzzy sets on Hough transform of
character pattern pixels from which additional fuzzy sets are synthesized using t- norms.
Garain et al. [36] proposed an online handwriting recognition system for Bangla. A low
complexity classifier has been designed and the proposed similarity measure appears to be
quite robust against wide variations in writing styles.
Pal, Wakabayashi and F. Kimura [37] proposed a recognition system for handwritten
offline compound Bangla characters using Modified Quadratic Discriminate Function
(MQDF). The features used for recognition purpose are mainly based on directional
information obtained from the arc tangent of the gradient. To get the feature, at first, a 2 X
2 mean filtering is applied 4 times on the gray level image and non-linear size
normalization is done on the image.
2.1.3 Recognition of Tamil Characters
The work on recognition of Tamil characters started in 1978 by Siromony et al. [38]. They
described a method for recognition of machine-printed Tamil characters using an encoded
character string dictionary. The scheme employs string features extracted by row- and
column- wise scanning of character matrix. Features in each row (column) are encoded
suitably depending upon the complexity of the script to be recognised. Chandrasekaran et
al. [39] used similar approach for constrained hand-printed Tamil character recognition.
Chinnuswamy and Krishnamoorthy [40] presented an approach for hand-printed Tamil
character recognition employing labeled graphs to describe structural composition of
characters in terms of line-like primitives. Recognition is carried out by correlation
matching of the labeled graph of the unknown character with that of the prototypes.
A piece of work on on-line Tamil character recognition is reported by Aparna et al. [41].
They used shape-based features including dot, line terminal, bumps and cusp. Comparing
an unknown stroke with a database of strokes does stroke identification. Finite state
automation has been used for character recognition with an accuracy of 71.32-91.5%.
2.1.4 Recognition of Telugu Characters
A two-stage recognition system for printed Telugu alphabets has been described by
Rajasekaran and Deekshatulu [42]. In the first stage a directed curve tracing method is
employed to recognize primitives and to extract basic character from the actual character
pattern. In the second stage, the basic character is coded, and on the basis of the
knowledge of the primitives and the basic character present in the input pattern, the
classification is achieved by means of a decision tree. Lakshmi and Patvardhan [43]
presented a Telugu OCR system for printed text of multiple sizes and multiple fonts.
HINDI LANGUAGE: A REVIEW :Hindi is an Indo-Aryan language and is one the official languages of India. It is the
worlds third most commonly used language after Chinese and English and has
approximately 500 million speakers all over the world. It is written in Devnagari script. It
is written from left to right along a horizontal line. The basic character set has 13
SWARS (vowel) and 33 VYANJANS (consonants) shown in the figure.
DEVNAGARI SCRIPT :Hindi is worlds third most commonly used language after Chinese and English, and there
are approximately 500 billion people all over the world who speak and write in Hindi. It
is the basic script of many languages in India, such as Hindi and Sanskrit. It is
indisputable that Devnagari has the most accurate scientific basis. For a long time, it has
been script of Indian Aryan languages. It is even now used by Sanskrit, Hindi, Marathi and
Nepali languages. Hindi is the worlds widely spoken language and since its script is
Devnagari, so its the most popular script. As Hindi has been declared the national
language by constitution of Indian, Devnagari has got the status of national dialect.
In the beginning, Hindi was declared as the state language and Devnagari as the state
script of other major states such as Himachal, Haryana, Rajasthan, Madhya Pradesh,
Bihar, Uttaranchal, etc. Presently, it is found that Devnagari script is the most scientific
script. Since every script is developed from Brahmi script so, Devnagari has connection
with almost every other script. In Devnagari, all letters are equal, i.e. there is no concept of
capital or small letters. Devnagari is half syllabic in nature.
Optical Character Recognition (Ocr):OCR is the acronym for Optical Character Recognition. This technology allows a machine
to
automatically recognize characters through an optical mechanism. Many objects are
recognized by human beings in this manner. Optical mechanism are the Eyes while the
brain sees the input, according to many factors the ability to appreciatethese signals
varies in each person. Reviewing the variables, the challenges faced by the technologist
developing an OCR system can be understood easily. Documents are in the form of papers
which the human can read and understand but it is not possible for the computer to
understand these documents directly.
In order to convert these documents into computer process able form, OCR systems are
developed. OCR is the process of converting scanned images of machine printed or
handwritten text, numerals, letters and symbols into a computer process able format such
as ASCII. OCR is an area of pattern recognition and processing of handwritten character is
motivated largely by desire to improve man and machine communication.
Proposed Algorithm:The system performs character recognition by exploring the feature of templates matching
for its ability to recognize handwritten Hindi characters.
The following steps are followed:
1- A database of Hindi handwritten character is created in different handwritings from
different peoples.
2- Preprocessing of training image.
a) Binarization of image using function bw= im2bw(Ibw,level).
b) Edge detection of image using function iedge=edge(uint8(BW2)).
c) Dilation
of
image
using
function
se = strel(square,2) ; iedge2=imdilate(iedge,se) .
d) Region filling of image using function ifill = imfill(iedge2,holes).
e) Character detection in image using [Ilabel num] = bwlabel(Ifill);
Iprops = regionprops(Ilabel);
Ibox = [Iprops.BoundingBox];
Scanning :Handwritten character data samples are acquired on paper from various
people. These data samples are then scanned from paper through an optically digitizing
device such as optical scanner or camera. A flat-bed scanner is used at 300dpi which
converts the data on the paper being scanned into a bitmap image.
Fig. 2.
The scanned image must be [4, 5] a grayscale image or binary image, where binary image
is a contrast stretched grayscale image. That grayscale image is then undergoes
digitization. In digitization [12] a rectangular matrix of 0s and 1s are formed from the
image. Where 0-black and 1-white and all RGB values are converted into 0s and 1s.The
matrix of dots represents two dimensional arrays of bits.
Digitization is also called binarization as it converts grayscale image into binary image
using adaptive threshold. Line and Boundary detection is the process of identifying points
in a digital image at which the character top, bottom, left and right are calculated. Feed
Forward Neural Network approach is used to combine all the unique features, which are
taken as inputs, one hidden layer is used to integrate and collaborate[9] similar features
and if required adjust the inputs by adding or subtracting weight values, finally one output
layer is used to find the overall matching score of the III. CHARACTER RECOGNITION
PROCEDURE
Pre-processing:- The pre-processing stage yields a clean document in the sense that
maximal shape information with maximal compression and minimal noise on normalized
image is obtained.
Segmentation: - Segmentation is an important stage because the extent one can reach in
separation of words, lines or characters directly affects the recognition rate of the script.
Feature extraction:- After segmenting the character, extraction of feature like height,
width, horizontal line, vertical line, and top and bottom detection is done.
Classification:- For classification or recognition back propagation algorithm is used.
Output:-Output
is
saved
in
form
of
text
format.
the mean square error decreased gradually and became stable, and the training and testing
error produced satisfactory results .the training performance curve of neural network.
The accuracy of the trained network is tested against output data. The accuracy of the
trained network is assessed in the following way: in first way, the predicted output value is
compared with the measured values .The results are presented shows the relative accuracy
of the predicted output. The overall percentage error obtained from the tested results is
4%. In the second way, the root mean square error and the mean absolute error are
determined and compared. The performance index for training of ANN is given in terms
of mean square error (MSE).The tolerance limit for the MSE is set to 0.001.The MSE of
the training set become stable at 0.0070 when the number of iteration reaches 350.
The closeness of the training and the testing errors validates the accuracy of the model. V.
EXPERIMENTAL RESULTS We create interface for proposed system for character
recognition by using Microsoft Visual C # 2008 Express Editions. The MLP network that
is implemented is composed of three layers input layer, output layer and hidden layer. The
input layer constitutes of 180 neurons which receive printed image data from a 30x20
symbol pixel matrix. The hidden layer constitutes of 256 neurons whose [12] number is
decided on the basis of optimal results on a trial and error basis. The output layer is
composed of 16 neurons. Number of characters=90, Learning rate=150, No of neurons in
hidden layer=256 TABLE I: PERCENTAGE OF ERROR FOR DIFFERENT EPOCHS
2. Existing Techniques
2.1 Modified discrimination function (MQDF) Classifier
G. S. Lehal and Nivedan Bhatt [10] designed a recognition system for handwritten
Devangari Numeral using Modified discrimination function (MQDF) classifier. A
recognition rate and a confusion rate were obtained as 89% and 4.5% respectively.
R. J. Ramteke et.al applied classifiers on 2000 numerals images obtained from different
individuals of different professions. The results of PCA, correlation coefficient and
perturbed moments are an experimental success as compared to MIs. This research
produced 92.28% recognition rate by considering 77 feature dimensions.
PROPOSED APPROACH
3.1 Support Vector Machine (SVM)
SVM in its basic form implement two class classifications. It has been used in recent years
as an alternative to popular methods such as neural network. The advantage of SVM, is
that it takes into account both experimental data and structural behavior for better
generalization capability based on the principle of structural risk minimization (SRM). Its
formulation approximates SRM principle by maximizing the margin of class separation,
the reason for it to be known also as large margin classifier. The basic SVM formulation is
for linearly separable datasets.
It can be used for nonlinear datasets by indirectly mapping the nonlinear inputs into to
linear feature space where the maximum Margin decision function is approximated. The
mapping is done by using a kernel function. Multi class classification can be performed by
modifying the 2 class scheme. The objective of recognition is to interpret a sequence of
numerals taken from the test set. The architecture of proposed system is given in fig. 3.The
SVM (binary classifier) is applied to multi class numeral recognition problem by using
one-versus-rest type method.
The SVM is trained with the training samples using linear kernel.
Classifier performs its function in two phases; Training and Testing. [29] After
preprocessing and Feature Extraction process, Training is performed by considering the
feature vectors which are stored in the form of matrices. Result of training is used for
testing the numerals. Algorithm for Training is given in algorithm.
3.2 Statistical Learning Theory
Support Vector Machines have been developed by Vapnik in the framework of Statistical
Learning Theory [13]. In statistical learning theory (SLT), the problem of classification in
supervised learning is formulated as follows: We are given a set of l training data and its
class, {(x1,y1)...(xl,yl)} in Rn R sampled according to unknown joint probability
distribution P(x,y) characterizing how the classes are spread in Rn R. To measure the
performance of the classifier, a loss function L(y,f(x)) is defined as follows: L(y,f(x)) is
zero if f classifies x correctly, one otherwise. On average, how f performs can be described
by the Risk functional: ERM principle states that given the training set and a set of
possible classifiers in the hypothesis space F, we Should choose f F that minimizes
Remp(f). However, which generalizes well to unseen data due to over fitting phenomena.
Remp(f) is a poor, over-optimistic approximation of R(f), the true risk. Neural network
classifier relies on ERM principle.
The normal practice to get a more realistic estimate of generalization error, as in neural
network is to divide the available data into training and test set. Training set is used to find
a Classifier with minimal empirical error (optimize the weight of an MLP neural
networks) while the test set is used to find the generalization error (error rate on the Test
set). If we have different sets of classifier hypothesis space F1, F2 e.g. MLP neural
networks with different topologies, we can select a classifier from each hypothesis space
(each topology) with minimal Remp(f) and choose the final classifier with minimal
generalization error. However, to do that requires designing and training potentially large
number of individual classifiers. Using SLT, we do not need to do that. Generalization
error can be directly minimized by minimizing an upper bound of the risk functional R(f).
The bound given below holds for any distribution P(x,y) with probability of at least 1- :
where the parameter h denotes the so called VC (Vapnik-Chervonenkis) dimension. is
the
confidence term defined by Vapnik [10] as : ERM is not sufficient to find good classifier
because even with small Remp(f), when h is large compared to l, will be large, so R(f)
will also be large, ie: not optimal. We actually need to minimize Remp(f)and at the same
time, a process which is called structural risk Minimization (SRM). By SRM, we do not
need test set for model selection anymore.
Taking different sets of classifiers F1, F2 with known h1, h2 we can select f from
one of the set with minimal Remp(f), compute and choose a classifier with minimal
R(f).No more evaluation on test set needed, at least in theory. However, we still have to
train potentially very large number of individual classifiers. To avoid this, we want to
make h tunable (ie: to cascade a potential classifier Fi with VC dimension = h and choose
an optimal f from an optimal Fi in a single optimization step. This is done in large margin
classification.
3.3 SVM formulations
SVM is realized from the above SLT framework. The simplest formulation of SVM is
linear, where the decision hyper plane lies in the space of the input data x. In this case the
hypothesis space is a subset of all hyper planes of the form: f(x) = wx +b. SVM finds an
optimal hyper plane as the solution to the learning Problem which is geometrically the
furthest from both classes since that will generalize best for future unseen data.
There are two ways of finding the optimal decision hyper plane. The first is by finding a
plane that bisects the two closest points of the two convex hulls defined by the set of
points of each class, as shown in figure 2. The second is by maximizing the margin
between two supporting planes as shown in figure 3. Both methods will produce the same
optimal decision plane and the same set of points that support the solution (the closest
points on the two convex hulls in figure 2 or the points on the two parallel supporting
planes in figure 3). These are called the support vectors.
4. Feature Extraction
4.1 Moment Invariants
The moment invariants (MIs) [1] are used to evaluate seven distributed parameters of a
numeral image. In any character Recognition system, the characters are processed to
extract features that uniquely represent properties of the character. Based on normalized
central moments, a set of seven moment invariants is derived. Further, the resultant image
was thinned and seven moments were extracted. Thus we had 14 features (7 original and 7
thinned), which are applied as features for recognition using Gaussian Distribution
Function. To increase the success rate, the new features need to be extracted by applying
Affine Invariant Moment method.
4.2 Affine Moment Invariants
The Affine Moment Invariants were derived by means of the theory of algebraic
invariants. Full derivation and comprehensive discussion on the properties of invariants
can be found. Four features can be computed for character recognition. Thus overall 18
features have been used for Support Vector Machine.
5. Experiment
5.1 Data Set Description
In this paper, the UCI Machine learning data set are used. The UCI Machine Learning
Repository is a collection of databases, domain theories, and data generators that are used
by the machine learning community for the empirical analysis of machine learning
algorithms. One of the available datasets is the Optical Recognition of the Handwritten
Digits Data Set.
The dataset of handwritten assamese characters by collecting samples from 45 writers is
created. Each writer contributed 52 basic characters, 10 numerals and 121 assamese
conjunct consonants. The total number of entries corresponding to each writer is 183 (= 52
characters + 10 numerals + 121 conjunct consonants). The total number of samples in the
dataset is 8235 ( = 45 183 ). The handwriting samples were collected on an iball 8060U
external digitizing tablet connected to a laptop using its cordless digital stylus pen. The
distribution of the dataset consists of 45 folders.
This file contains information about the character id (ID), character name (Label) and
actual shape of the character (Char). In the raw Optdigits data, digits are represented as
32x32 matrices. They are also available in a pre- processed form in which digits have been
divided into non-overlapping blocks of 4x4 and the number of on pixels have been
counted in each block. This generated 8x8 input matrices where each element is an integer
in the range 0.16.
Preprocessing :A series of operations are performed on scanned image during preprocessing (figure 4).
Scanned image
The operations that are performed during preprocessing are:
(i) Applied median filtering to reduce noise from the introduced to the character image
during scanning. It is usually taken from a template centered on the point of interest. To
perform median filtering at a point values of the pixel and its neighbors are sorted into
order based upon their gray levels and their median is determined [12].
(ii) Global thresholding is applied to convert image from gray scale to binary form.
(iii) Image is normalized into 7X7.
(iv) Thinning is performed by the method proposed in [10].
PROPOSED WORK
Character recognition task has been attempted through many different approaches like
template matching, statistical techniques like NN, HMM, Quadratic Discriminant function
(QDF) etc. Template matching works effectively for recognition of standard fonts, but
gives poor performance with handwritten characters and when the size of dataset grows. It
is not an effective technique if there is font discrepancy . HMM models achieved great
success in the field of speech recognition in past decades, however developing a 2-D
HMM model for character recognition is found difficult and complex. NN is found very
computationally expensive in recognition purpose . N. Araki et al. applied Bayesian filters
based on Bayes Theorem for handwritten character recognition. Later, discriminative
classifiers such as Artificial Neural Network (ANN) and Support Vector Machine (SVM)
grabbed a lot of attention. In G. Vamvakas et al. compared the performance of three
classifiers: Naive Bayes, K-NN and SVM and attained best performance with SVM.
However SVM suffers from limitation of selection of kernel. ANNs can adapt to changes
in the data and learn the characteristics of input signal .Also, ANNs consume less storage
and computation than SVMs. Mostly used classifiers based on ANN are MLP and RBFN.
B.K. Verma presented a system for HCR using MLP and RBFN networks in the task of
handwritten Hindi character recognition. The error back propagation algorithm was used
to train the MLP networks. J. Sutha et al. in showed the effectiveness of MLP for Tamil
HCR using the Fourier descriptor features. R. Gheroie et al. in proposed handwritten Farsi
character recognition using MLP trained with error back propagation algorithm. Computer
Science & Information Technology (CS & IT) 27 similar shaped characters are difficult to
differentiate because of very minor variations in their structures. In T.
Wakabayashi et al. proposed an F-Ratio (Fisher Ratio) based feature extraction method to
improve results of similar shaped characters. They considered pairs of similar shaped
characters of different scripts like English, Arabic/Persian, Devnagri, etc. and used QDF
for recognition purpose. QDF suffers from limitation of minimum required size of dataset.
F. Yang et al. in [14] proposed a method that combines both structural and statistical
features of characters for similar handwritten Chinese character recognition.
As it can be seen that various feature extraction methods and classifiers have been used for
character recognition by researchers that are suitable for their work, we propose a novel
feature set that is expected to perform well for this application. In this paper, the features
are extracted on the basis of character geometry, which are then fed to each of the selected
ML algorithms for recognition of SSHMC. 3.Methodology for feature extraction A device
is to be designed and trained to recognize the 26 letters of the alphabet.
We assume that some imaging system digitizes each letter centered in the systems field of
vision. The result is that each letter is represented as a 5 by 7 grid of real values. The
following figure shows the perfect pictures of all 26 letters. Figure 1: The 26 letters of
the alphabet with a resolution of 5 7. However, the imaging system is not perfect and the
letters may suffer from noise: Figure 2: A perfect picture of the lettar A and 4 noisy
versions (stabdard devistion of 0.2).
Perfect classification of ideal input vectors is required, and more important reasonably
accurate classification of noisy vectors. Before OCR can be used, the source material must
be scanned using an optical scanner (and sometimes a specialized circuit board in the PC)
to read in the page as a bitmap (a pattern of dots). Software to recognize the images is also
required. The character recognition software then processes these scans to differentiate
between images and text and determine what letters are represented in the light and dark
areas. Older OCR systems match these images against stored bitmaps based on specific
fonts. The hit-or-miss results of such pattern-recognition systems helped establish OCR's
reputation for inaccuracy. Today's OCR engines add the multiple algorithms of neural.
SOLUTION APPROACH
On-line handwriting recognition involves the automatic conversion of text as it is written
on a special digitizer or PDA where a sensor picks up the pen-tip movements as well as
pen-up/pen-down switching. This kind of data is known as digital ink and can be regarded
as a digital representation of handwriting. The obtained signal is converted into letter
codes which are usable within computer and text-processing applications.
The elements of an on-line handwriting recognition interface typically include:
a touch sensitive surface, which may be integrated with, or adjacent to, an output
display.
a software application which interprets the movements of the stylus across the
preprocessing,
classification.
The purpose of preprocessing is to discard irrelevant information in the input data, that can
negatively affect the recognitio. This concerns speed and accuracy. Preprocessing usually
consists of binarization, normalization, sampling , smoothing and denoising. The second
step is feature extraction. Out of the two- or more-dimensional vector field received from
the preprocessing algorithms, higher-dimensional data is extracted. The purpose of this
step is to highlight important information for the recognition model. This data may include
information like pen pressure, velocity or the changes of writing direction. The last big
step is classification. In this step various models are used to map the extracted features to
different classes and thus identifying the characters or words the features represent.
Hardware
Commercial products incorporating handwriting recognition as a replacement for
keyboard input were introduced in the early 1980s. Examples include handwriting
terminals such as the Pencept Penpad and the Inforite point-of-sale terminal.With the
advent of the large consumer market for personal computers, several commercial products
were introduced to replace the keyboard and mouse on a personal computer with a single
pointing/handwriting system, such as those from PenCept, CIC and others. The first
commercially available tablet-type portable computer was the GRiDPad from grid system,
released in September 1989. Its operating system was based on MS-DOS.
In the early 1990s, hardware makers including NCR, IBM and EO released tablet
computers running the PenPoint operating system developed by GO Corp. PenPoint used
handwriting recognition and gestures throughout and provided the facilities to third-party
software. IBM's tablet computer was the first to use the ThinkPad name and used IBM's
handwriting recognition. This recognition system was later ported to Microsoft Windows
for Pen Computing, and IBM's Pen for OS/2. None of these were commercially
successful.
Advancements in electronics allowed the computing power necessary for handwriting
recognition to fit into a smaller form factor than tablet computers, and handwriting
recognition is often used as an input method for hand-held PDAs. The first PDA to
provide written input was the Apple Newton, which exposed the public to the advantage
of a streamlined user interface.
However, the device was not a commercial success, owing to the unreliability of the
software, which tried to learn a user's writing patterns. By the time of the release of
the Newton OS 2.0, wherein the handwriting recognition was greatly improved, including
unique features still not found in current recognition systems such as modeless error
correction, the largely negative first impression had been made. After discontinuation
of Apple Newton, the feature has been ported to Mac OS X 10.2 or later in form
of Inkwell (Macintosh) Palm later launched a successful series of PDAs based on
the Graffiti recognition system. Graffiti improved usability by defining a set of
"unistrokes", or one-stroke forms, for each character. This narrowed the possibility for
erroneous input, although memorization of the stroke patterns did increase the learning
curve for the user.
The Graffiti handwriting recognition was found to infringe on a patent held by Xerox, and
Palm replaced Graffiti with a licensed version of the CIC handwriting recognition which,
while also supporting unistroke forms, pre-dated the Xerox patent. The court finding of
infringement was reversed on appeal, and then reversed again on a later appeal. The
parties involved subsequently negotiated a settlement concerning this and other
patents Graffiti (Palm OS).
A Tablet PC is a special notebook computer that is outfitted with a digitizer tablet and a
stylus, and allows a user to handwrite text on the unit's screen. The operating system
recognizes the handwriting and converts it into typewritten text.
Windows Vista and Windows 7 include personalization features that learn a user's writing
patterns or vocabulary for English, Japanese, Chinese Traditional, Chinese Simplified and
Korean. The features include a "personalization wizard" that prompts for samples of a
user's handwriting and uses them to retrain the system for higher accuracy recognition.
This system is distinct from the less advanced handwriting recognition system employed
in its Windows Mobile OS for PDAs.
Although handwriting recognition is an input form that the public has become accustomed
to, it has not achieved widespread use in either desktop computers or laptops. It is still
generally accepted that keyboard input is both faster and more reliable. As of 2006, many
PDAs offer handwriting input, sometimes even accepting natural cursive handwriting, but
accuracy is still a problem, and some people still find even a simple on-screen keyboard
more efficient.
Software
Initial software modules could understand print handwriting where the characters were
separated. Author of the first applied pattern recognition program in 1962 was Shelia
Guberman, then in Moscow. Commercial examples came from companies such as
Communications Intelligence Corporation and IBM. In the early 90s, two companies,
ParaGraph International, and Lexicus came up with systems that could understand cursive
handwriting recognition. ParaGraph was based in Russia and founded by computer
scientist Stepan Pachikov while Lexicus was founded by Ronjon Nag and Chris Kortge
who were students at Stanford University. The ParaGraph CalliGrapher system was
deployed in the Apple Newton systems, and Lexicus Longhand system was made
available commercially for the PenPoint and Windows operating system.
Lexicus was acquired by Motorola in 1993 and went on to develop Chinese handwriting
recognition and predictive text systems for Motorola. ParaGraph was acquired in 1997 by
SGI and its handwriting recognition team formed a P&I division, later acquired from SGI
by Vadem. Microsoft has acquired CalliGrapher handwriting recognition and other digital
ink technologies developed by P&I from Vadem in 1999. Wolfram Mathematica (8.0 or
later) also provides a handwriting or text recognition function TextRecognize.
Character recognition task has been attempted through many different approaches like
template matching, statistical techniques like NN, HMM, Quadratic Discriminant function
(QDF) etc. Template matching works effectively for recognition of standard fonts, but
gives poor performance with handwritten characters and when the size of dataset grows. It
is not an effective technique if there is font discrepancy .
HMM models achieved great success in the field of speech recognition in past decades,
however developing a 2-D HMM model for character recognition is found difficult and
complex . NN is found very computationally expensive in recognition purpose. N. Araki et
al. applied Bayesian filters based on Bayes Theorm for handwritten character recognition.
Later, discriminative classifiers such as Artificial Neural Network (ANN) and Support
Vector Machine (SVM) grabbed a lot of attention.
In G. Vamvakas et al. compared the performance of three classifiers: Naive Bayes, K-NN
and SVM and attained best performance with SVM. However SVM suffers from
limitation of selection of kernel. ANNs can adapt to changes in the data and learn the
characteristics of input signal. Also, ANNs consume less storage and computation than
SVMs . Mostly used classifiers based on ANN are MLP and RBFN. B.K. Verma [10]
presented a system for HCR using MLP and RBFN networks in the task of handwritten
Hindi character recognition.
The error back propagation algorithm was used to train the MLP networks. J. Sutha et al.
in showed the effectiveness of MLP for Tamil HCR using the Fourier descriptor features.
R. Gheroie et al. in proposed handwritten Farsi character recognition using MLP trained
with error back propagation algorithm. Computer Science & Information Technology (CS
& IT) 27 Similar shaped characters are difficult to differentiate because of very minor
variations in their structures. In T.
Wakabayashi et al. proposed an F-Ratio (Fisher Ratio) based feature extraction method to
improve results of similar shaped characters. They considered pairs of similar shaped
characters of different scripts like English, Arabic/Persian, Devnagri, etc. and used QDF
for recognition purpose. QDF suffers from limitation of minimum required size of dataset.
F. Yang et al. in proposed a method that combines both structural and statistical features
of characters for similar handwritten Chinese character recognition.
As it can be seen that various feature extraction methods and classifiers have been used
for character recognition by researchers that are suitable for their work, we propose a
novel feature set that is expected to perform well for this application. In this paper, the
features are extracted on the basis of character geometry, which are then fed to each of the
selected ML algorithms for recognition of SSHHC.
C4.5 is an extension of Ross Quinlan's earlier ID3 algorithm. It builds decision trees from
a
set of training data using the concept of information gain and entropy. C4.5 uses a white
box
28 Computer Science & Information Technology (CS & IT) model due to which the
explanation of results is easy to understand. Also, it performs well with even large amount
of data.
Dataset 3 consists of samples of both the target and non-target class, i.e. other similar
shaped
character pairs (like , ; , ) are also added to the dataset (making 500 samples in total)
with
which the ML algorithms are trained. Non-target class characters are added to test the
ability of ML classifiers for target characters among different characters. A few samples of
the entire
dataset are shown in Figure 1.
4.2. Performance Metrics
Performance of the classifiers is evaluated on the basis of the metrics described below:
i Precision: Proportion of the examples which truly have class x among all those which
were
classified as class x. Figure. 1 Samples of Handwritten Hindi Characters Computer
Science & Information Technology (CS & IT) 29
ii Misclassification Rate: Number of instances that were classified incorrectly out of the
total
instances.
iii Model Build Time: Time taken to train a classifier on a given data set.
4.3. Pre-processing
Following pre-processing steps are applied to the scanned character images:
i First each RGB character image, after converting to gray scale, is binarized through
thresholding.
ii The image is inverted such that the background is black and foreground is white.
iii Then shortest matrix that fits the entire character skeleton for each image is obtained
and
this is termed as universe of discourse.
iv Finally, the spurious pixels are removed from the image followed by skeletonization.
4.4. Feature Extraction
After pre-processing, features for each character image are extracted based on the
character
geometry using the technique described in [24]. The features are based on the basic line
types that form the skeleton of the character.
Each pixel in the image is traversed. Individual line segments, their directions and
intersection points are identified from an isolated character image. For this, the image
matrix is initially divided into nine zones and the number, length and type of lines and
intersections present in each zone are determined. The line type can be: Horizontal,
Vertical, Right Diagonal and Left Diagonal. For each zone following features are
extracted. It results into a feature vector of length 9 for each zone:
i. Number of horizontal lines
ii. Number of vertical lines
iii. Number of Right diagonal lines
iv. Number of Left diagonal lines
To perform the analysis, click the Run Analysis button. Please be aware that it may take
some
time. After the analysis is complete, the other tabs in the sample application will be
populated with theof each factor found during the discriminant analysis is plotted in a pie
graph for easy visual inspection. Once the analysis is complete, we can test its
classification ability in the testing data set.
The green rows have been correctly identified by the discriminant space Euclidean
distance classifier. We can see that it correctly identifies 98% of the testing data. The
testing and training data set are disjoint and independent. analysis' information.
Results
After the analysis has been completed and validated, we can use it to classify the new
digits
drawn directly in the application. The bars on the right show the relative response for each
of the discriminant functions. Each class has a discriminant function that outputs a
closeness measure for the input point. The classification is based on which function
produces the maximum output.
Handwritten Devanagari Character sets are taken from test .bmp image. Steps are
followed to obtain best accuracy of input handwritten Hindi character image given to the
system. First of all, training of system is done by using different data set or sample. And
then system is tested for few of the given sample, and accuracy is measured. For each
character, feature were computed and stored in templates for training the system.
The sets of handwritten Gurumukhi characters are made. The data set was partitioned into
two parts. The first one is used for training the system and the second one for testing. For
each character, features were computed and stored for training the network. Three network
layers, i.e. one input layer, one hidden layer and one output layer are taken. If number of
neurons in the hidden layer is increased, then a problem of allocation of required memory
is occurred. Also, if the value of error tolerance is high, say 0.1, desired results are not
obtained, so changing the value of error tolerance i.e. say 0.01, high accuracy rate is
obtained. Also the network takes more number of cycles to learn when the error tolerance
value is less rather than in the case of high value of error tolerance in which network
learns in less number of cycles and so the learning is not very fine. The unit disk is taken
for each character by finding the maximum radius of the character (i.e. the maximum
distance between the center of the character and the boundary of the character), so that the
character could fit on the disk.
Here are some tables displaying the results obtained from the program. Sign images of the
same letter are grouped together on every table. The table gives us information about the
pre-processing operations that took place (i.e. noise, edge detection, filling of gap) and
also if the image belongs to the same database with the training images. The amount of
each filter is also recorded so maximum values of noise can be estimated that the network
can tolerate. This of course varies from character image to character image. The result also
varies for every time the algorithm is executed. The variance is very small but it is there.
Following are main results of Gurumukhi character recognition: -
Character
No. of Samples
200
Train/Test
180/20
% Accuracy
93%
196
155
184
192
160
176/20
130/25
169/15
162/30
140/20
87%
89%
71%
69%
81%
179
159/20
79%
168
195
177
191
180
195
187
169
199
188
166
196
189
168
178
196
171
182
184
169
180
170
193
185
176
148/20
170/25
152/25
166/25
165/15
170/25
167/20
149/20
174/25
168/20
146/20
176/20
164/25
148/20
158/20
176/20
151/20
162/20
164/20
149/20
155/25
150/20
173/20
165/20
146/30
84%
80%
90%
88%
86%
89%
96%
95%
92%
94%
82%
82%
88%
85%
84%
87%
81%
88%
80%
89%
76%
78%
71%
82%
70%
167
157
178
183
191
185
147/20
132/25
158/20
153/30
161/30
155/30
92%
85%
87%
69%
73%
70%
Fig 7: We can see the analysis also performs rather well on completely new and
previously unseen data.
It is observed that recognition rate using SVM is higher than Hidden Markov Model.
However, free parameter storage for SVM model is significantly higher. The memory
space required for SVM will be the number of support vectors multiply by the number of
feature values (in this case 350). This is significantly large compared to HMM which only
need to store the weight. HMM needs less space due to the weight-sharing scheme.
However, in SVM, space saving can be achieved by storing only the original online
signals and the penup/ pen-down status in a compact manner. During recognition, the
model will be expanded dynamically as required. Table 3 shows the comparison of
recognition rates between HMM and SVM using all three databases. SVM clearly
outperforms in all three isolated character cases.
The result for the isolated character cases above indicates that the recognition rate for the
hybrid word recognizer could be improved by using SVM instead of HMM. Thus, we are
currently implementing word recognizer using both HMM and SVM and comparing their
performance.
Sometimes characters are overlapped and joined. Large numbers of character and stroke
classes are present there. Different, or even the same user can write differently at different
times, depending on the pen or pencil, the width of the line, the slight rotation of the paper,
the type of paper and the mood and stress level of the person.The character can be written
at different location on paper or in window Characters can be written in different fonts.
recognition increases, although at a slow rate. The result of the last training by 50
character set and testing with the 10 character set are presented. It can be concluded
that as the network is trained with more number of sets, the accuracy of recognition
of characters will increase definitely.
Future scope
Over the past three decades, many different methods have been explored by a large
number of scientists to recognize characters. A variety of approaches have been
proposed and tested by researchers in different parts of the world, including
statistical methods, structural and syntactic methods and neural networks. No OCR
in this world is 100% accurate till date. The recognition accuracy of the neural
networks proposed here can be further improved. The number of character set used
for training is reasonably low and the accuracy of the network can be increased by
taking more training character sets. This approach of recognition is used for
recognition of Gurumukhi characters only. In future work, this can be implemented
for recognition of Gurumukhi words.
REFERENCES
[1] R. Plamondon and S. N. Srihari, On-line and off-line handwritten
recognition: a comprehensive survey, IEEE Transactions on PAMI, Vol. 22(1),
pp. 6384, 2000.
[2] Negi, C. Bhagvati and B. Krishna, An OCR system for Telugu, in the
Proceedings of the Sixth International Conference on Document Processing,
pp.1110-1114, 2001.
[3]
Hong, J.I. and Landay, J.A. SATIN: A Toolkit for Informal Ink-based
Applications. CHI Letters: ACM Symposium on UIST, 2 (2), 63-72.
[4]
S. Mori, C. Y. Suen and K. Yamamoto, Historical review of OCR research
and development, Proceedings of the IEEE, Vol. 80(7), pp. 1029-1058, 1992.
[5]
U. Pal and B. B. Chaudhuri, Indian script character recognition, Pattern
Recognition, Vol. 37(9), pp. 1887-1899, 2004.
[6]
H. Bunke and P. S. P. Wang, Handbook of Character Recognition and
Document Image Analysis, World Scientific Publishing Company, 1997.
[7]
Stephen V. Rice, George Nagy and Thomas A. Nartker, Optical Character
Recognition: An Illustrated Guide to the Frontier, Kluwer Academic Publications,
1999.
[8]
S. Mori, H. Nishida and H. Yamada, Optical Character Recognition, John
Wiley & Sons, 1999.
[9]
S. Impedovo, L. Ottaviano and S. Occhinegro, Optical character
recognition, International Journal Pattern Recognition and Artificial Intelligence,
Vol. 5(1-2), pp. 1-24, 1991.
[10] A. K. Jain, R. P. W. Duin and J. Mao, Statistical pattern recognition: a
review,IEEE Transactions on PAMI, Vol. 22(1), pp. 4-37, 2000.
[11] Manish Kumar, Degraded text recognition of gurmukhi scripts,Dspace,
Thapar University, Patiala.