Вы находитесь на странице: 1из 7

Volume 5, Issue 6, June 2015

ISSN: 2277 128X

International Journal of Advanced Research in


Computer Science and Software Engineering
Research Paper
Available online at: www.ijarcsse.com

Odia Offline Typewritten Character Recognition using Template


Matching with Unicode Mapping
Bikram Ballav*
M.Tech, CSE, ITER, SoA University
Bhubaneswar, Odisha, India

Joydeep Sengupta
B Tech , CSE, SoA University
Bhubaneswar, Odisha, India

Abstract Optical character Recognition (OCR) is a document image analysis method where scanned digital image
that contains either machine printed or handwritten script are input into a system to translate it into an editable
machine readable d digital text format. Hence OCR has been a topic of interest for researchers all around the globe in
the past decade and research paper involving OCR is increasing day by day. It is seen that efficient algorithms have
increased the speed and accuracy of character recognition. A substantial amount of work has been done on foreign
languages such as English , Chinese etc. but very few paper are there for Indian languages baring a few for Hindi
and Bengali. Hence our research work was directed towards development of a novel algorithm for Offline Typewritten
Odia Character recognition using Template Matching.
Keywords Odia Script; Character Recognition; Matching; Templates; Odia Unicode
I. INTRODUCTION
Optical Character Recognition is the process of translating images of handwritten or printed text into a format understood
by machines for the purpose of editing, indexing/searching, and a reduction in storage size [1]-[3].The first step in OCR
is going back to the roots of the languages and studying the individual characters which make up the language.Each
character is unique in many ways and if we can extract unique features of the individual character we can train the
computer about that particular character. Each character has different sets of features which can be used while comparing
with a test character. Hence by this way we can make the computer recognize a character. Since our study is focused on
the Odia character recognition. Odia is one of the oldest and is the official language of the Odisha state in the Indian
constituency. The Odia language consists of 50 different characters out of which 12 are vowels and rest are consonants,
Character recognition in this language is particularly
difficult because there are many similar looking characters and the combined characters are very difficult to segregate.
So to make recognition easier we have developed an algorithm. The recognition of characters and numeral of a language
is a challenging problem since their variations due to different font sizes and different types of variations introduced
during writing. The character recognition (CR) can be broadly classified into two groups:
On-line character recognition systems
Off-line character recognition systems
Online Character Recognition systems:
Online character recognition is the real time recognition of characters. Since Online character recognition uses online
systems which have better timing information for recognizing characters. Online Character Recognition also avoids the
initial step of locating the characters and which directly captures writing with the order of the strokes ([4], [5], [6]).
Offline Character Recognition:
Offline character recognition involves the automatic conversion of text from an image into letter code.In this type of
character recognition, typewritten characters usually scanned from images & then converted into grey/binary scale image
& then fed to recognition algorithm. Offline Character recognition is more challenging task than online. Since in this type
of recognition we have no control over the medium & devices ([4], [5], [6]).
Traditionally, OCR techniques are classified into:
Template based approach
Feature based approach
In template based approach, an unknown pattern is superimposed on the ideal template pattern and the degree of
correlation between the two is used for the decision about the classification. Early OCR systems employed only template
approach. But they become ineffective in the presence of noises, changes of handwriting, etc. Template matching is a
trainable process as template characters can be changed. Thus modern systems combine it with feature based approaches
to obtain better results.In Feature-based approach derives properties (features)form the test patterns and employs them in
a more sophisticated classification model, which are described in section IV.
2015, IJARCSSE All Rights Reserved

Page | 823

Ballav et al., International Journal of Advanced Research in Computer Science and Software Engineering 5(6),
June- 2015, pp. 823-829
Objective
The objective of the project is to develop a technique that can efficiently recognize typewritten characters of Odia
language. Our main emphasis is on the template matching part where the input character is directly matched with a set of
prototype characters representing each possible class. Further we concentrate our work on the Unicode mapping which
defines the uniform way of encoding multilingual text. The ultimate goal of character recognition is to simulate the
human reading capabilities.
Organization
Rest of the paper is organized as follows. Section II describes the motivation of the project. In Section III, Odia
languages and data collection is explained. The major steps in character recognition are discussed in Section IV.
Implementation details and proposed framework presented in Section V. The experimental results discussed in Section
VI and finally conclusion and future work of the paper is given in section VII.
II. MOTIVATION
A large amount of work has been done in the field of OCR but very little research has been done for the Odia language.
Since Odisha has a rich heritage of manuscripts and novels, which are need to be preserved in Odia language and Odia
scripts. That are in the process of being lost due to the lack of Odia OCR systems. The basic need for text recognition is
automatic recognition of alphabetic characters and numerals through computers. For some foreign languages the OCR
systems are developed but for some Indian Languages attempts are made for like: Devanagari, Tamil etc. Thus we are
making an attempt to develop the OCR system for type-written Odia language.
III. ODIA LANGUAGE AND DATA COLLECTION
India is a multi-lingual and multi-script country and Odia is one of the popular languages in India which is mainly used
in the state of Odisha. The Odia script, by which Oriya language is written, is developed from the Kalinga script, one of
the many descendents of the Brahmi script of ancient India. As like other Indian scripts also in Odia language, the
concept of upper/lower case is absent. Among all these 12 independent vowels, 11 vowels have dependent forms
(i.e.excluding first vowel). The alphabet of the modern Odia script consists of 12 vowels and 41 consonants. These
characters are called basic characters and the Odia Numerals of Odia script and their corresponding English numerals are
shown in Fig.1.1, Fig.1.2 and Fig.1.3.Writing style of Odia script is from left to right. In Odia script a vowel following a
consonant takes a modified shape. Depending on the vowel, its modified shape is placed at the left, right, both left and
right, or bottom of the consonant. These modified shapes are called modifiers or matra as shown in Fig.1.4. A consonant
or a vowel following a consonant sometimes takes a compound orthographic shape, which we call as compound
character.Compound characters can be combinations of two consonants as well as a consonant and a vowel. There are
more than 200 compound characters in Odia script [7] and in this paper we consider the recognition of off-line
typewritten Odia basic characters by using template matching with unicode.Some similarity shaped characters may make
difficulties and makes the recognition system more complex to get higher recognition rate which are shown in Fig.1.5.
By character recognition, the character symbols of a language are transformed into symbolic representations such as
ASCII, or Unicode. The basic problem is to assign the digitized character into its symbolic class. The Unicode of Odia
characters are shown in Fig.8.

Figure 1.1 Odia Vowels

Figure 1.2 Odia Consonants

Figure 1.3 Odia Numerals


2015, IJARCSSE All Rights Reserved

Page | 824

Ballav et al., International Journal of Advanced Research in Computer Science and Software Engineering 5(6),
June- 2015, pp. 823-829

Figure 1.4 Odia Modified Characters

Figure 1.5 Similar Shaped Characters


IV.
MAJOR STEPS IN ODIA CHARACTER RECOGNITION
The process of Odia Character Recognition consists of a series of stages, with each stage passing its results on to the next
in pipeline fashion as shown in Fig.2. There is no feedback loop that would permit an earlier stage to make use of
knowledge gained at a later point in the process.
Optical Character recognition is a system which loads a character (text) image, preprocesses the image, extracts proper
image features, classify the characters based on the extracted image features (in the form of vector matrix) and the known
features are stored in the image model library, and recognizes the image according to the degree of similarity between the
loaded image and the image models. To recognize character firstly, the input images are acquired containing Odia text as
an input image.
Images are then stored in some picture file such as BMP, JPG, GIF etc.This image subsequently passes through
preprocessing, segmentation, feature extraction and classification steps. But unicode mapping is a technique which is the
measure part of our project.
Preprocessing operations include image processing, binarization, noise reduction and skew detection & correction of a
digital image so that subsequent algorithms along the road to final classification can be made simple and more accurate.

Figure 2. Block Diagram of OCR


Segmentation includes line segmentation-extract lines from a paragraph, and character segmentation-extract haracter
from a line. After completing preprocessing and segmentation some features are extracted from the character image.The
techniques for extraction of such features are often divided into three main groups, where the features are found from:
The distribution of points.
Transformations and series expansions.
Structural analysis.
2015, IJARCSSE All Rights Reserved

Page | 825

Ballav et al., International Journal of Advanced Research in Computer Science and Software Engineering 5(6),
June- 2015, pp. 823-829
Template-matching technique is the process of finding the location of a sub image called a template inside an image
which is different from the others in that no features are actually extracted. Instead the matrix containing the image of the
input character is directly matched with a set of prototype characters representing each possible class. The distance
between the pattern and each prototype is computed, and the class of the prototype giving the best match is assigned to
the pattern.The technique is simple and easy to implement in hardware and has been used in many commercial OCR
machines. However, this technique is sensitive to noise and style variations and has no way of handling rotated
characters.Also a small number of possible postures can be recognized. If the application requires a large posture set,then
template matching will not work better.
V. METHODOLOGY AND PROPOSED ALGORITHM
The steps of proposed algorithm for Odia Optical Character Recognition (OOCR) are implemented in MATLAB
(R2010.a/64-bit) version as per the above block diagram shown in Fig.2.
Database Creation:
Initially, we have created a database of all character images of Odia scripts from - of pixels 5050.
Data Acquisition:
Through the scanning process a digital image of the original document is captured. Scanned images are then stored in
some picture file such as BMP, JPG, GIF etc. as shown in Fig.3.

(a) (b) Figure 3. Two Input Images


RGB to gray conversion:
In the pre-processing 1st stage is to convert the input RGB image into gray scale image as shown in Fig. 4.
Binarization:
Binarization is the process of converting a gray scale image (0 to 255 pixel values) into binary image (0 and 1 pixel
values) by selecting a threshold value in between 0 to 255 (here threshold value is 128) as shown in Fig. 4.

a).Binary Image b). Gray Image Figure 4. Document image binarization


Skew Detection & Correction:
While scanning the image, if the paper/source document is not aligned properly, it may cause the components to be tilted.
This could lead to erroneous behaviour of the OCR system. To prevent this, Skew detection & Correction method has
been devised, which detect & remove the skew from the image and later the boundaries of particular images are adjusted
so that image looks like an original image.
Segmentation:
It is an operation that seeks to decompose an image of sequence of characters into sub images of individual symbols.
Character segmentation is a key requirement that determines the utility of conventional Character Recognition systems. It
includes line, word and character segmentation. Different methods used can be classified based on the type of text and
strategy being followed like recognition-based segmentation.
A. Line segmentation:
In a printed script, the text lines are almost of same height, provided that the script is written in a specific font size. Here
the script is composed by a type-machine, so the font size is uniform everywhere. Between two text lines, there is a
narrow horizontal band with either no pixel or very few pixels. Hence, by checking break-points through them and
storing them will be useful for detecting the valleys in it, text line bands can be retrieved.
B. Character segmentation:
After the line segmentation, each and every line which is segmented before going through the process of character
segmentation. Each line is segmented in its individual characters (isolated) for further operation.
Feature Extraction:
After character segmentation, features from each segmented character are extracted which is in the form of Matrix as
shown in Fig.5.
2015, IJARCSSE All Rights Reserved

Page | 826

Ballav et al., International Journal of Advanced Research in Computer Science and Software Engineering 5(6),
June- 2015, pp. 823-829

Figure 5. Character Extraction in form of Matrix


Classification using Template-matching:
This technique is different from the others because no features are actually extracted. Instead the matrix containing the
image of the input character is directly matched with a set of prototype characters representing each possible class. The
distance between the pattern and each prototype is computed, and the class of the prototype giving the best match is
assigned to the pattern. The technique is simple and easy to implement in hardware and has been used in many
commercial OCR machines. There are two steps in building a classifier: training and testing. These steps can be broken
down further into sub-steps as shown in Fig.6.

Figure 6. Template Mapping Approach


Unicode mapping:
The Unicode Standard is the Universal Character encoding scheme which defines the uniform way of encoding
multilingual text that enables the exchange of text data internationally and creates the foundation of global software. The
Odia Unicode range is U+0B00 to U+0B7F.
For example, the Unicode for the character is 0B05; the Unicode for the character is 0B19.The Unicode
characters are comprised of 2 bytes in nature. The Unicode standard reflects the basic principle which emphasizes that
each character code has a width of 16 bits. Unicode text is simple to parse and process and Unicode characters have well
defined semantics. Hence Unicode is chosen as the encoding scheme for the current work.
After classification the characters are recognized and a mapping table is created in which the Unicode for the
corresponding characters are mapped. As shown in Fig.7 Table shows the Unicode with the corresponding Odia
characters.

Figure 7. Unicode Mapping


2015, IJARCSSE All Rights Reserved

Page | 827

Ballav et al., International Journal of Advanced Research in Computer Science and Software Engineering 5(6),
June- 2015, pp. 823-829
VI. RESULTS
No standardized test sets exist for character recognition, and as the performance of an OCR system is highly
dependent on the quality of the input, this makes it difficult to evaluate and compare different systems. Still, recognition
rates are often given, and usually presented as the percentage of characters correctly classified. According to result Fig.
8.3 and 8.4 only 2nd vowel is not matching properly.
%Accuracy = No.of characters found correctly *100 Total no.of patterns
%Accuracy= 46 *100 =97.87% 47
So %Accuracy= 97.87%
To illustrate the accuracy of Odia characters, typewritten text images of different fonts of different sizes have
been tested under OCR algorithm by using MATLAB (R2010.a/64-bit) and then performance was measured using this
sample as shown in the Fig.8.1 and Fig. 8.2.

Figure 8.1 Input Image with Noise

Figure 8.2 Output with Text Format


The evaluation of OCR system follows three different performance rates:
Recognition rate:
The proportion of correctly classified characters.
Rejection rate:
The proportion of characters which the systems were unable to recognize. Rejected characters can be flagged by the
OCR-system, and are therefore easily retraceable for manual correction.
Error rate:
The proportion of characters erroneously classified. Misclassified characters go by undetected by the system, and manual
inspection of the recognized text is necessary to detect and correct these errors.

Figure 8.3 Input image with all vowels and consonants


2015, IJARCSSE All Rights Reserved

Page | 828

Ballav et al., International Journal of Advanced Research in Computer Science and Software Engineering 5(6),
June- 2015, pp. 823-829

Figure 8.4 Output in text format of all vowels and consonants


VII CONCLUSIONS AND FUTURE WORK
The type written Odia character recognition algorithm was successfully tested using large number of test sample of
different font images. Accuracy was about 97% if basic characters and numerals are considered. Around 3% of the
characters deviated due to similarity between the characters. Our work was basically focused on Template matching
technique with Unicode mapping which can efficiently extract features and match the template with unicode from each
individual characters. As a result. The recognition process of this system become smoothly and even though this system
prototype could give several advantages to the users, but this system prototype are still facing a number of limitations
with handwritten characters and compound characters. Recognition of character is still a challenging problem since there
is a variation in same character due to different font size, different types of noises and involvement of different persons.
So that, further research could be done to improve the system prototype into a better system by taking the handwritten
characters and compound characters with template matching technique.
REFERENCES
[1]
D.A. Jadhav and G. K. Veeresh, Multi-Font/Size Character Recognition, International Journal of Advances in
Engineering & Technology, May 2012.
[2]
Optical character recognition http://en.wikipedia.org/wiki/Optical_character_recognition.
[3]
Ritesh Kapoor, Sonia Gupta, and C.M. Sharma Multi-font/size character recognition and document scanning,
Int. J. of computer application, pp. 21-24, vol. 23, no.1, June 2011.Tavel, P. 2007 Modeling and Simulation
Design. AK Peters Ltd.
[4]
Priya Sharma and Randhir Singh Performance of English Character Recognition with and without Noise
International Journal of Computer Trends and Technology- volume4Issue3- 2013.
[5]
http://www.cs.uic.edu/~srizvi/BIT_Thesis.pdf
[6]
Special Issue on Character Recognition and Document Understanding, IEICE Trans. Information and Systems,
vol. E79-D, no. 5, July 1996.
[7]
B. B. Chaudhuri, U. Pal and M. Mitra, Automatic recognition of printed Oriya script, IEEE Transactions on
Pattern Recognition and Machine Intelligence,Vol.27,part 1. pp.23-34, February 2002.(section III).
[8]
Chen, M. Y., Kundu, A. and Zhou, J. Off-line Handwritten Word Recognition using a HMM Type Stochastic
Network, IEEE Transactions on Pattern Recognition and Machine Intelligence,Vol.16,pp. 481-486, 1994.
[9]
Jagruti Chandarana, Mayank Kapadia, Optical Character Recognition,IJETAE Tranjaction, Volume 4, Issue
5, May 2014.
[10]
The Unicode Standard(U0B00.pdf)http://www.unicode.org/Public/7.0.0/charts/ for a complete archived file of
character code charts for Unicode 7.0.

2015, IJARCSSE All Rights Reserved

Page | 829

Вам также может понравиться