Вы находитесь на странице: 1из 44

Project Report On

OPTICAL CHARACTER RECOGNITION SYSTEM FOR PRINTED TEXT


Submitted to
Rashtrasant Tukadoji Maharaj Nagpur University, Nagpur for partial fulfillment of the degree in

Bachelor of Engineering
(Information Technology) Seventh Semester
by Aabha Shivhare Bhagyashree Darokar Lavina Agrawal Mayuri Kayande

Under the Guidance of Prof. C. U. Upadhyay

Department of Information Technology Shri Ramdeobaba College of Engineering & Management Nagpur-13 2011-2012

CERTIFICATE
This is to certify that the Project Report on

OPTICAL CHARACTER RECOGNITION SYSTEM FOR PRINTED TEXT


is a bonafide work and it is submitted to Rashtrasant Tukadoji Maharaj Nagpur University, Nagpur by Aabha Shivhare Bhagyashree Darokar Lavina Agrawal Mayuri Kayande For partial fulfillment of the degree in Bachelor of Engineering in Information Technology
Seventh Semester

during the academic year 2011-2012 under the guidance of

Prof. C. U. Upadhyay
Assistant Professor , RCOEM, Nagpur

Dr. D. S. Adane
Head, Department of Information Technology RCOEM, Nagpur

Dr. V. S. Deshpande
Principal RCOEM, Nagpur

Department of Information Technology Shri Ramdeobaba College of Engineering & Management Nagpur-13
2011-2012

ACKNOWLEDGEMENT
We would like to sincerely thank our project guide Prof. C. U. Upadhyay , who helped us to proceed in the right direction and solve out the problems which we faced during our project work.

We would like to extend our gratitude to our honorable Principal, honorable HOD and the whole staff of Information Technology Dept. and the staff college library. Lastly we would like to extend our thanks to those who have contributed directly or indirectly to make this project a success.

PROJECTEES Aabha Shivhare Bhagyashree Darokar Lavina Agrawal Mayuri Kayande

CONTENTS
Page No.

ABSTRACT LIST OF FIGURES LIST OF TABLES CHAPTER 1 INTRODUCTION


1.1 TYPES OF CHARACTER RECOGNITION SYSTEM 1.1.1 ONLINE CHARACTER RECOGNITION 1.1.2 OFFLINE CHARACTER RECOGNITION DIFFERENT PHASES OF CHARACTER RECOGNITION PRE-PROCESSING 1.3.1 BINARIZATION 1.3.2 SEGMENTATION 1.3.3 FEATURE EXTRACTION AND CLASSIFICATION BACKGROUND

i iii iv

1.2 1.3

1.4

1 2 2 3 4 4 5 5 6

CHAPTER 2 METHODOLOGY
2.1 PREPROCESSING 2.1.1 BINARIZATION 2.1.2 BOUNDING OF CHARACTER SEGMENTATION 2.2.1 ALGORITHM FEATURE EXTRACTION 2.3.1 APPROACH 1 : COORDINATE ANALYSIS 2.3.2 APPROACH 2 : CURVE FITTING 10 11 11 11 12 13 13 14

2.2 2.3

CHAPTER 3 IMPLEMENTATION
3.1 3.2 3.3 RESULT CONCLUSION SNAPHOTS 23 24 25

CHAPTER 4 APPLICATIONS OF CHARACTER RECOGNITION SYSTEM Bibliography

28 30

ABSTRACT
In this computer era, the information is processed by computers. For that purpose the information has to be present in computer recognizable form. But typewritting has continued to persist as a means of communication and recording information in day - to-day life even with the introduction of new technologies and to make it available as digital data ; it has to be entered manually in the computer system. There is a need to provide a

platform for converting printed documents into digital data. Hence, we intend to convert printed data in the form of digital images into electronic form. Here focus is on the recognition of offline printed characters that can be used in common applications like bank cheques , commercial forms , government records , bill processing systems , Postcode Recognition , Signature Verification , passport readers , offline document recognition generated by the expanding technological social.

Optical character recognition OCR, is the mechanical or electronic translation of scanned images of handwritten, typewritten or printed text into machine-encoded text. It is widely used to convert books and documents into electronic files, to computerize a recordkeeping system in an office, or to publish the text on a website. OCR makes it possible to edit the text, search for a word or phrase, store it more compactly, display or print a copy free of scanning artifacts, and apply techniques such as machine translation, text-to-speech and text mining to it. OCR is a field of research in pattern recognition, artificial intelligence and computer vision.

If software is not giving proper output, there should be a way for training the database of software. The input image is to be in the bitmap file format . In case of scanned image, a high quality scanner as well as good paper quality is required.

The aim of the project OCR is to develop OCR software for offline recognition. OCR is a field of research in pattern recognition, artificial intelligence and machine vision. Input is taken by scanning of printed text. The recognized characters are stored in the text file. Thus we can make the computer read the Printed document.

DELIVERABLES:

INPUT: The user has to upload the photocopy of document with printed English characters which can be easily obtained. The image can be any format compatible with Matlab.

OUTPUT: The printed document is then converted into .txt form. The output text document will contain all the text data which was present in the input photocopy. This document (which is of the form .txt) can be saved by the user for further processing.

ii

LIST OF FIGURES
Serial No. Figure 1.1 Figure 1.2 Figure 2.1 Figure 2.2 Figure 2.3 Figure 2.4 Figure 2.5 Figure 2.6 Figure 2.7 Figure 2.8 Figure 2.9 Figure 2.10 Figure 2.11 Figure 2.12 Figure 2.13 Figure 3.1 Figure 3.2 Figure 3.3 Figure 3.4 Figure 3.5 Description Types of character recognition system Block diagram of printed text recognition Bounding of characters The image file of letter S Left scan of figure 4.1 Top scan of figure 4.1 Right scan of figure 4.1 Down scan of figure 4.1 Red points are considered while analysis instead of all the points obtained in each scan Segmentation of character A left scan Curve fit tool in Matlab The coordinates if considered x and y i.e. (xy) do not fit The coordinates if considered y and x i.e. (xy) do fit Fit obtained for coordinates for top scan Curve fitting for top scan in 2nd degree Typewritten document of english capital letters The image is selected Line segmentation Character segmentation The recognized characters in the text file Page No. 2 4 11 12 12 12 12 12 13 17 17 19 19 20 21 25 25 26 26 27

iii

LIST OF TABLES
Serial No. Table 1.1 Table 2.1 Table 2.2 Table 2.3 Table 2.4 Table 3.1 Table 3.2 Description Comparison between online and offline handwritten characters R Square values obtained for left scan of letter S R Square values obtained for Top scan of letter S R Square values obtained for right scan of letter S R Square values obtained for down scan of letter S Comparison of recognition accuracy of handwritten english characters Letters conflicting in approach 1 Page No.
3

18 20 21 22 23 24

iv

CHAPTER 1 INTRODUCTION

1. INTRODUCTION
Optical character recognition (OCR) is aimed to enable computers to recognize optical characters without human intervention. In the early 1950s, David Shepard developed the first machine called Gismo to convert printed material into machine language. Shepard then founded Intelligent Machines Research Corporation (IMR), which produced the first OCR systems for commercial operation.

OCR , optical character recognition refers to the branch of computer science that involves reading text from paper and translating the images into a form that the computer can manipulate (for example, into ASCII codes). An OCR system enables you to take a book or a magazine article, feed it directly into an electronic computer file , and then edit the file using a word processor.

Optical character recognition OCR, is the mechanical or electronic translation of scanned images of handwritten, typewritten or printed text into machine-encoded text. It is widely used to convert books and documents into electronic files, to computerize a recordkeeping system in an office, or to publish the text on a website. OCR makes it possible to edit the text, search for a word or phrase, store it more compactly, display or print a copy free of scanning artifacts, and apply techniques such as machine translation, text-to-speech and text mining to it. OCR is a field of research in pattern recognition, artificial intelligence and computer vision.

1.1 TYPES OF CHARACTER RECOGNITION SYSTEM


The constant development of computer tools leads to a requirement of easier interfaces between the man and the computer. Character Recognition is one way of achieving this. A Character Recognition deal with the problem of reading offline handwritten character i.e. at some point in time (in mins, sec, hrs) after it has been written. However recognition of unconstrained handwritten text can be very difficult because characters cannot be reliably isolated especially when the text is cursive handwriting. They are classified as the following two types as shown in figure shown on the next page :

Figure 1.1: Types of character recognition system

1.1.1 ONLINE CHARACTER RECOGNITION In case of online character recognition, there is real time recognition of characters. Online systems have better information for doing recognition since they have timing information and since they avoid the initial search step of locating the character as in the case of their offline counterpart. Online systems obtain the position of the pen as a function of time directly from the interface. Offline recognition of characters is known as a challenging problem because of the complex character shapes and great variation of character symbols written in different modes. 1.1.2 OFFLINE CHARACTER RECOGNITION

In case of offline character recognition, the typewritten/handwritten character is typically scanned in form of a paper document and made available in the form of a binary or gray scale image to the recognition algorithm. Offline character recognition is a more challenging and difficult task as there is no control over the medium and instrument used. The artifacts of the complex interaction between the instrument medium and subsequent operations such as scanning and Binarization present additional challenges to the algorithm for the offline character recognition. Therefore offline character recognition is considered 2

as a more challenging task then its online counterpart. The steps involved in character recognition after an image scanner optically captures text images to be recognized is given to the recognition algorithm.

The major difference between Online and Offline Character Recognition is that Online Character Recognition has real time contextual information but offline data does not This difference generates a significant divergence in processing architectures and methods.

Comparisons Definition

On-line characters

Off-line characters

In online recognition systems, input is an image of a handprinted text which is usually acquired from a tablet computer or pen-based devices such as cell phone and sign pad. Yes # samples/second(e.g. 100) Using digital pen on LCD surface

In offline recognition, the image of the type or handwritten text is acquired through scanning using an optical scanner. The image then is read by the system and is analyzed for recognition. No # dots/inch(e.g. 300) Paper document

Availability of no. of pen-strokes Raw Data Requirement Way of writing

Table 1.1 Comparison between online and offline handwritten characters

1.2 DIFFERENT PHASES OF CHARACTER RECOGNITION


The process of printed text recognition of English characters can be divided into phases as shown in Figure 2. Each phase has been explained below:

Figure 1.2 Block diagram of printed text recognition

1.3 PRE-PROCESSING
Pre-processing is the name given to a family of procedures for smoothing, enhancing, Filtering, cleaning-up and otherwise massaging a digital image so that subsequent algorithm along the road to final classification can be made simple and more accurate. The scanned image always contains noise that usually appears as an extra pixel (black or white) in the character image. If the noise is not taken into consideration, it could subvert the process and produce an incorrect result. Various Pre-processing Methods are explained below: 1.3.1 BINARIZATION Document image Binarization (thresholding) refers to the conversion of a gray-scale image into a binary image. Two categories of thresholding: Global, picks one threshold value for the entire document image which is often based on an estimation of the background level from the intensity histogram of the image. Adaptive, uses different values for each pixel according to the local area information.

1.3.2 SEGMENTATION

It is an operation that seeks to decompose an image of sequence of characters into sub images of individual symbols. Character segmentation is a key requirement that determines the utility of conventional Character Recognition systems. It includes line, word and character segmentation.

Different methods used can be classified based on the type of text and strategy being followed like recognition-based segmentation and cut classification method. After
scanning the document, the document image is subjected to pre-processing for background noise elimination and skew correction to generate the bit map image of the text. The preprocessed image is then segmented into lines, words and characters.

1.3.3 FEATURE EXTRACTION AND CLASSIFICATION

Character Recognition system consists of two stages, feature extraction and classification. Feature extraction is the name given to a family of procedures for measuring the relevant shape information contained in a pattern so that the task of classifying the pattern is made easy by a formal procedure.

The feature extraction stage analyses a text segment and selects a set of features that can be used to uniquely identify the text segment. The selection of a stable and representative set of features is the heart of pattern recognition system design.

Classification stage is the main decision making stage of the system and uses the features extracted in the previous stage to identify the text segment according to preset rules. Classification is concerned with making decisions concerning the class membership of a pattern in question. The task in any given situation is to design a decision rule that is easy to compute and will minimize the probability of misclassification relative to the power of feature extraction scheme employed. Patterns are thus transformed by feature extraction process into points in d dimensional feature space. A pattern class can then be represented by a region or sub-space of the feature space. Classification then becomes a problem of determining the region of feature space in which an unknown pattern falls.

1.4 BACKGROUND
Graphic Files

A Graphic file is a file containing a picture that may be a line or scanned photograph. Any program that displays or manipulates stored images needs to be able to store image for a later use. Data in graphic files can be encoded in two different ways.

ASCII Text

This is a readable text which is easy for humans to read and to some extent to edit and easy for programs to read and write. But it is bulky and slow to read and write from programs.

Compressed Format (Binary Formats) They are very compact but incomprehensible to human and require complex reading and writing routines. They vary a lot in terms of the flexibility they offer for the image size, shape, colors and their attributes. At one end is the TIFF (Tagged Input File Format) with so many different options and features that not TIFF implementation can read them all and at other end is Mac Paint which allows storing the image in exactly one size, two colors and one way. The graphic files are further classified as of two types in terms of the manner in which they store the image

Bitmapped Format

Here the picture is represented as rectangular array of dots. It stores complete digitally encoded images. They are also called as raster or dot-matrix description. It is used when the images are, in large part, created by hand or scanned from an original document or photograph using some type of scanner. A few types of bitmapped graphic files formats are: TIFF (Tagged Input File Format) GIF (Graphics Interchange Format) BMP (Bit map Format) 6

Mac Paint IMG TGA (Targa) JPEG (Joint Photographic Expert Group).

Vector Formats They represent a picture as a series of lines and arcs i.e. it stores the individual graphics that make up the image. These images are also called as line images. As most of the lines that are needed could be represented by relatively simple mathematical equations hence, images could be stored economically. For example, to specify a straight line all that is needed is a knowledge of the positions of the two end points of the line and for display purposes the line can then be reconstructed knowing the geometrical properties. Similarly to draw a circle all that is needed is knowledge of its centre and its radius. The advantages of vector formats are:
They require less size.

Their quality is not affected when the images are magnified as contrasting to the pixel images.

Pixel A pixel (picture-element) is a dot or the most fundamental unit that makes up the image. All pixels have a value associated with them called as the pixel value representing the color for that point/pixel. For the simplest pictures, each point is black or white so the pixel value is either 0 or 1, a single bit. However commonly, the picture is in grayscale or color, in which case there has to be a large range of pixel values. For a grayscale image, each pixel might be 8 bits, so the value could range from 0 for black to 255 for white. True Color 24 bit color represents the limit of the human eyes ability to differentiate colors, Thus to human eye, there is no perceptible difference between a 24 bit color image of an object and the object viewed directly. Hence it is referred to as the true color.

Palette / Color Map Full color images can be very large. A 600* 800 image may contain 4, 80,000 pixels. If each of the pixel we stored as 24-bit value than the image would consume 1.4 MB. To decrease the amount of space needed to store the image, the concept of color map or palette is used. Rather than storing the actual color of each pixel in the file, the color maps contains a list of all colors used in the image and the individual pixel values are stored as entry numbers in the color map/palette. A typical color map has 16 or 256 entries, so each pixel value is only 4 or 8 bits, an enormous savings from 24 bits per pixel. Programs can create various screen effects by changing the color map. The advantage of using the color map is that
The amount of RAM and memory needed to store the image is considerably reduced. The image definition is virtualized. The value of the latter can be demonstrated by considering the task of changing one color in the image instead of changing all pixels of the color in image, we need to change only the palette entry for that color.

Color Model
A color model is a formal way for representing and defining colors. A synonymous term is photometric interpretation. There are different types of color models.

Resolution
Graphic images on the screen are made up of tiny dots called pixels or picture elements. The display resolution is defined by the number of rows (called scan line) from top to bottom and number of pixels from left to right on each scan line. Each mode uses a particular resolution, higher the resolution more pleasing is the picture. Higher resolution means a sharper, clearer picture with less pronounced staircase effect on drawing lines diagonally and better looking text characters. High resolution requires more memory requirement to display the pictures.

Windows Bitmap Format (BMP) The windows BMP format is a general purpose format for storing Device Independent Bitmaps (DIBs). By DIB we mean that the physical interpretation of the image and its palette are fixed without regard to the requirements of any potential display device. It is most often used to store screen and scanner generated imagery.

The BMP file only supports single line bitmaps of 1, 4, 8 or 24 bits per pixel. One annoying aspect of BMP is that image is stored by scan line proceeding from the bottom row to the top. All other formats use the reverse order or at least support top-to-bottom order as an option. Top to bottom is a defacto standard. BMP breaks the file into four separate components File An image header. An array of palette entries. Actual bitmap

When dealing with BMP it is recommended to use a palette unless we are dealing with a 24-bit image. BMP supports image compression by RLE (run length encoding) only images with 4 bit and 8 bit per pixel sizes can be encoded. The interpretation of encoded image data slightly depends on which pixel size is present. Scanned file in the BMP format are padded with unused bits in the end so that their length is an integral number of double words i.e. the number of bytes is evenly divisible by 4. Despite the fact that the format supports compression, its rare to find an application that actually bothers to encode image data in this format thus, only a few BMP files are compressed.

CHAPTER 2 METHODOLOGY

2. METHODOLOGY
The tool used for our project is MATLAB. The name MATLAB stands for matrix laboratory. MATLAB is a high-performance language for technical computing. It integrates computation, visualization, and programming in an easy-to-use environment where problems and solutions are expressed in familiar mathematical notation. Typical uses include: Math and computation Algorithm development Modeling, simulation, and prototyping Data analysis, exploration, and visualization Scientific and engineering graphics Application development, including Graphical User Interface building

MATLAB is an interactive system whose basic data element is an array that does not require dimensioning. This allows, solving many technical computing problems, especially those with matrix and vector formulations, in a fraction of the time it would take to write a program in a scalar non-interactive language such as C or FORTRAN.

MATLAB has evolved over a period of years with input from many users. In university environments, it is the standard instructional tool for introductory and advanced courses in mathematics, engineering, and science. In industry, MATLAB is the tool of choice for high-productivity research, development, and analysis. The reason that I have decided to use MATLAB for the development of this project is its toolboxes. Toolboxes allow you to learn and apply specialized technology. Toolboxes are comprehensive.

2.1 PREPROCESSING
Pre-processing is the name given to a family of procedures for smoothing, enhancing, Filtering, cleaning-up and otherwise massaging a digital image so that subsequent algorithm along the road to final classification can be made simple and more accurate. Various Pre-processing Methods are explained below:

10

2.1.1 BINARIZATION

Document image Binarization (thresholding) refers to the conversion of a gray-scale image into a binary image. Two categories of thresholding: Global: picks one threshold value for the entire document image which is often based on an estimation of the background level from the intensity histogram of the image. Adaptive (lopcal): uses different values for each pixel according to the local area information. Convert gray scale image to binary image Binarization is done Based on luminosity of each pixel Luma=Red * 0.3 + Green * 0.59 + blue * 0.11 If luma >127 than color is white Else black

2.1.2 BOUNDING OF CHARACTERS

Here, bounding box of all characters is created. The advantage of creating bounding box is that area of a particular character can be calculated easily. In this, there is no limitation of number of characters. Any number of characters can be boxed which are included in the given character image

Figure 2.1 Bounding of characters

2.2 SEGMENTATION
It is an operation that seeks to decompose an image of sequence of characters into sub images of individual symbols. Character segmentation is a key requirement that determines the utility of conventional Character Recognition systems. It includes line, word and character segmentation.

11

Different methods used can be classified based on the type of text and strategy being followed like recognition-based segmentation and cut classification method. After scanning the document, the document image is subjected to pre-processing for background noise elimination and skew correction to generate the bit map image of the text. The preprocessed image is then segmented into lines, words and characters.

2.2.1 ALGORITHM The image is scanned separately from all four sides. After this two approaches were used as follows:-

Approach 1

When we scan from a particular side, we only store the first, last and mid pixel which is black. These three points are used in determining the type of curve, which may be any of a left parabola, right parabola, upper parabola, down parabola, vertical line, slant line and pair of straight lines. This information is stored in an array for every character which is used in later stage for determining character.

Approach 2

Each point obtained in every side scanning is considered here to determine the particular fit.

Figure 2.2

Figure 2.3

Figure 2.4

Figure 2.5

Figure 2.6

Figure 2.2 Figure 2.3 Figure 2.4 Figure 2.5 Figure 2.6

The image file of letter S left scan of figure 2.2 top scan of figure 2.2 right scan of figure 2.2 down scan of figure 2.2 12

2.3 FEATURE EXTRACTION


The feature extraction stage analyses a text segment and selects a set of features that can be used to uniquely identify the text segment. For feature extraction two approaches were used.

2.3.1 APPROACH 1: CO-ORDIANTES ANALYSIS

For example, the following characters can be segmented into different curves as follows:-

Figure 2.7 Red points are considered while analysis instead of all the points obtained in each scan

The information from all four sides is stored in the info array. One row for each side and each column stores information as follows:-

13

First column stores the info that whether it is a parabola or line. Second column stores that which type of parabola or line it is. If it is parabola then it can be left, right, up, down. If it is line then it can be slant, or tilted or vertical.

This information is analyzed and then conditions for each letter are put to identify a particular letter.

2.3.2 APPROACH 2: CURVE FITTING In this approach, the data obtained from four sides scanning is individually fitted using interpolation. From the goodness of fit, best fit is found and compared with that in database. Goodness Of Fit: The goodness of fit of a statistical model describes how well it fits a set of observations. Measures of goodness of fit typically summarize the discrepancy between observed values and the values expected under the model in question.

QR Decomposition

In linear algebra, a QR decomposition (also called a QR factorization) of a matrix is a decomposition of the matrix into an orthogonal and an upper triangular matrix. QR decomposition is often used to solve the linear least squares problem, and is the basis for a particular eigen value algorithm, the QR algorithm. About R2

A data set has values yi, each of which has an associated modeled value fi (also sometimes referred to as i). Here, the values yi are called the observed values and the modeled values fi are sometimes called the predicted values. The "variability" of the data set is measured through different sums of squares:

14

the total sum of square (proportional to the sample variance);

the regression sum of squares, also called the explained sum of squares.

, the sum of squares of residuals, also called the residual sum of squares. In the above is the mean of the observed data:

Where n is the number of observations. The notations SSR and SSE should be avoided, since in some texts their meaning is reversed to Residual sum of squares and Explained sum of squares, respectively. The most general definition of the coefficient of determination is

Relation to unexplained variance In a general form, R2 can be seen to be related to the unexplained variance, since the second term compares the unexplained variance (variance of the model's errors) with the total variance (of the data). Explained variance In some cases the total sum of squares equals the sum of the two other sums of squares defined above,

See sum of squares for a derivation of this result for one case where the relation holds. When this relation does hold, the above definition of R2 is equivalent to

In this form R2 is given directly in terms of the explained variance: it compares the explained variance (variance of the model's predictions) with the total variance (of the data).

15

Interpretation R2 is a statistic that will give some information about the goodness of fit of a model. In regression, the R2 coefficient of determination is a statistical measure of how well the regression line approximates the real data points. An R2 of 1.0 indicates that the regression line perfectly fits the data. Values of R2 outside the range 0 to 1 can occur where it is used to measure the agreement between observed and modeled values and where the "modeled" values are not obtained by linear regression and depending on which formulation of R2 is used. If the first formula above is used, values can never be greater than one. If the second expression is used, there are no constraints on the values obtainable. In many (but not all) instances where R2 is used, the predictors are calculated by ordinary least-squares regression: that is, by minimizing SSerr. In this case R-squared increases as we increase the number of variables in the model (R2 will not decrease). This illustrates a drawback to one possible use of R2, where one might try to include more variables in the model until "there is no more improvement". This leads to the alternative approach of looking at the adjusted R2. The explanation of this statistic is almost the same as R2 but it penalizes the statistic as extra variables are included in the model. For cases other than fitting by ordinary least squares, the R2 statistic can be calculated as above and may still be a useful measure. If fitting is by weighted least squares or generalized least squares, alternative versions of R2 can be calculated appropriate to those statistical frameworks, while the "raw" R2 may still be useful if it is more easily interpreted. Values for R2 can be calculated for any type of predictive model, which need not have a statistical basis.

Approach Segmentation

Here we scan the letter from four sides (i.e. left, right, top & bottom). The first black pixel from every column is stored in input array. This input array is given as input to curve fitting tool of Matlab.

16

Figure 2.8 Segmentation of character A left scan

The Curve Fitting tool of Matlab then fits the input points into polynomial equations from 1st degree to 7th degree. The input to the Curve Fitting is given in two forms, one X is to Y & other is Y is to X, whichever of these fits as a good function is considered to decide which curve is it.

Output of Curve Fitting:

The tool gives the coefficients of fitted equations for every degree of polynomial (1st to7th). It also provides goodness of fit variables such as R2.

Figure 2.9 Curve fit tool in Matlab

17

Analysis The closer the value of R2 to 1 the better is the fit. If the value of R2 is above 0.9 in the first degree polynomial then it is a line. From the equation of line we can decide the orientation of line. Similarly for second degree polynomial if R2 > 0.9 then it is a parabola. From the sign of coefficient of square term in equation we can whether the parabola is facing top, bottom, left or right. Similar method is followed for upper degree polynomials. This data is used to differentiate the the English alphabets.
Sample 1: For figure 2.2 For figure 2.3

It is the left side scan For this, the curve fit analysis is

Polynomial Linear Polynomial Quadratic Polynomial Cubic Polynomial 4th degree Polynomial 5th degree Polynomial 6th degree Polynomial 7 degree Polynomial 8 degree Polynomial
th th

R Square 0.00701 0.01744 0.006231 0.4077 0.4886 0.5251 0.6264 0.6398

Table 2.1 R Square values obtained for left scan of letter S


For figure 2.4

It is the top side scan For this, the curve fit analysis is

18

Figure 2.10 The coordinates if considered x and y i.e. (xy) do not fit

Figure 2.11 The Coordinates if considered y and x i.e. (xy) do fit

19

Figure 2.12 Fit obtained for coordinates for top scan


Polynomial Linear Polynomial Quadratic Polynomial Cubic Polynomial 4th degree Polynomial 5th degree Polynomial 6th degree Polynomial 7th degree Polynomial 8th degree Polynomial R Square 0.004026 0.6978 0.7008 0.8945 0.8945 0.9353 0.9392 0.9392

Table 2.2 R Square values obtained for Top scan of letter S

For figure 2.4 It is the right side scan For this, the curve fit analysis is

20

Polynomial Linear Polynomial Quadratic Polynomial Cubic Polynomial 4th degree Polynomial 5th degree Polynomial 6th degree Polynomial 7th degree Polynomial 8th degree Polynomial

R Square 0.07194 0.07564 0.2402 0.408 0.6009 0.6063 0.717 0.7448

Table 2.3 R Square values obtained for right scan of letter S

For figure 2.5 It is the down side scan For this, the curve fit analysis is

Figure 2.13 Curve fitting for top scan in 2nd degree

21

Polynomial Linear Polynomial Quadratic Polynomial Cubic Polynomial 4th degree Polynomial 5th degree Polynomial 6th degree Polynomial 7th degree Polynomial 8th degree Polynomial

R Square 0.02188 0.9297 0.9321 0.9867 0.9872 0.9939 0.994 0.9952

Table 2.4 R Square values obtained for down scan of letter S

Conclusion from analysis: For letter S, following analysis was found which was unique for S. From Tables can be seen: The 7th degree (xy) R Square of left side is greater than 0.6. The 7th degree (xy) R Square of right side is greater than 0.65. The 2nd degree (yx) R Square of top side is greater than 0.65. The 2nd degree (yx) R Square of left side is greater than 0.85. This condition is checked with current letter. If this condition is satisfied, letter S is recognized. Similarly analysis of values of R Square of all 26 letters was done and features were found which were used to recognize the letters.

22

CHAPTER 3 IMPLEMENTATION

3. IMPLEMENTATION
3.1 RESULT
Typewritten English Character sets and numerals (0-9) are taken. These steps are followed to obtain best accuracy of input printed English Capital character image. First of all, training of system is done by using different data set or sample. And then system is tested for few of the given sample, and accuracy is measured. The table given below display the results obtained from the program. The variance is very small but it is there. Following are main results of English capital character recognition:

Feature Method used

Approach 1 Coordinates analysis

Approach 2 Curve fitting

No

of

letters

19

(NJSQGBY

26(A-Z)

recognized Characters recognized Size

not recognized) English capital English Capital

Can vary

Can

vary

but on for

limitation lower size

some letters like O, Q Font Works for some fonts which are similar like arial, Calibri Memory required Less Works for arial and related font but needed More analysis

Table 3.1 Comparison of recognition accuracy of printed english characters

23

Letter that can be recognized H U O D V

Other conflicting Letters (Which cannot be recognized)

N J S B Y Table 3.2 Letters conflicting in approach 1 Q G

Letters conflicting in approach 2 O and Q, H and N when size is small.

3.2 CONCLUSION
Recognition approaches heavily depend on the nature of the data to be recognized. The typewritten English characters can be recognized by various techniques, the recognition process needs to be much efficient and accurate to recognize the characters written by different users. There are few reasons that create problem in English character recognition. Some characters are similar in shape (for example H and A, O and Q). Different or even the same user can write differently at different times, depending on the pen or pencil, the width of the line, the slight rotation of the paper, the type of paper and the mood and stress level of the person. The character can be written at different location on paper or in window. Characters can be written in different fonts.

24

3.3 SNAPSHOTS
Step 1: Preprocessing The input image which is to be converted into machine encoded text.

Figure 3.1 Type written document of english capital letters

The image scanned using scanner is shown in following figure. The scanned image contains noise and this causes jagged appearance in the image.

Figure 3.2 The image is selected

25

Step 2: Segmentation The next step after the image is selected is line segmentation. Each line of the scanned image is recognized using line segmentation.

Figure 3.3 Line segmentation

After recognizing each line of the image, the next step is to recognize each character. This is done using character segmentation. Following image shows how the word YEAR is segmented into characters.

Figure 3.4 Character segmentation 26

Lastly, the image is converted into editable text document.

Figure 3.5 The recognized characters in the text file

27

CHAPTER 4 APPLICATIONS OF CHARACTER RECOGNITION SYSTEM

4. APPLICATIONS OF CHARACTER RECOGNITION SYSTEM


There are number of applications of Character Recognition System: Task-specific Readers Task specific readers are used primarily for high-volume applications which require high system throughput. Since high throughput rates are desired, handling only the fields of interest helps reduce time constraint. Since similar document possess similar size and layout structure, it is straight forward for the image scanner to focus on those fields where the desired information lies. This approach can considerably reduce the image processing and text recognition time. Some application areas to which task-specific readers have been applied include: Assigning ZIP codes to letter mail. Reading data entered in forms, e.g. tax forms Verification of account numbers and courtesy amounts on bank checks Automatic accounting procedure used in processing utility bills Automatic accounting of airline passenger tickets Automatic validation of passports Address Readers

The address reader in a postal mail sorter locates the destination address block on a mail piece and reads the ZIP code in this address block. If additional fields in the address block are read with high confidence the system may generate a 9 digit ZIP code for the piece. The resulting ZIP code is used to generate a bar code which is sprayed on the envelope. The Multi line Optical Character Reader (MLOCR) used by the United States Postal Services (USPS) locates the address block on a mail piece , reads the whole address, identifies the ZIP+4 code generates 9-digit bar code and sorts the mail to the correct stacker. The character classifier recognizes up to 400 fonts and the system can process up to 45,000 mail pieces per hour. 28

Form Reader

A form reading system needs to discriminate between pre-printed form instructions and filled- in data. The system is first trained with a blank form. The system registers those areas on the form where the data should be printed. During the form recognition phase, the system uses the spatial information obtained from training to scan the region that should be filled with data. Some readers read hand-printed data as well as various machine written texts. They can read data on a form without being confused with the form instructions. Some systems can process forms at a rate of 5,800 forms per hour. Cheque Reader

A cheque reader captures cheque image and recognize courtesy amounts and accounts information on the cheques and use the information in both fields to cross check the recognition result. An operator can correct misclassified characters by cross-validating the recognition results with the cheque image that appears on a system console. Bill Processing System

In general a bill processing system is used to read payment slips, utility bills and inventory documents .The system focuses on certain region on a document where the expected information are located, e.g. account number and payment value. Passport Readers

An automated passport reader is used to speed up the returning American passengers through custom inspection. The Reader reads a travelers, date of birth and passport number on the passport and checks these against the database records that contain information on fugitive felons and smugglers. General Purpose Page Readers

There are two general categories of page reader: high-end page readers and lowend page readers. High-end page readers are more advanced in recognition capability and 29

higher data throughput than the low-end page readers. A low-end page reader usually does not come with a scanner and it is compatible with many flat-bed scanners. They are mostly used in an office environment with desktop work stations, which are less demanding in system throughput. Since they are designed to handle a broader range of documents, a sacrifice of recognition accuracy has to be made .Some commercial OCR software allow users to adapt the recognition engine to customer data for improving recognition accuracy.

30

BIBLIOGRAPHY
ONLINE RESOURCES:
http://en.wikipedia.org/wiki/Natural_language_processing#Major_tasks_in_NLP

http://www.codeproject.com/KB/recipes/UnicodeOCR.aspx

(users.info.unicaen.fr/~szmurlo/papers/masters/master.thesis.ps.gz)

http://www.freetechebooks.com/doc-2011/ocr-with-matlab.html

dspace.thapar.edu:8080/dspace/bit stream/10266/789/.../full+final+thesis.pdf

31

Вам также может понравиться