Вы находитесь на странице: 1из 5

The Eighth International Conference on Electronic Measurement and Instruments ICEMI2007

Offline Handwritten Numeral Recognition Based on


Principal Component Analysis
Wan Junli1 Huang Yuehua1 Zhang Guohua1 Wan Cheng2

(1. China Three Gorges University, Yichang 443002 China )


(2.Wuhan University, Wuhan 430072 China )

Abstract: To overcome the difficulty of fusing statistical


2 Handwritten Numeral Image
feature and structural feature in the research on handwritten
numeral recognition, Principal Component Analysis is used Preprocess
to reconstruct numeral model and estimate the numeral
Handwritten numeral is one special graph
reconstructive error based on the statistical information of
image. The graph is valuable to the recognition. So,
digit structural feature. At the same time, the height-width
handwritten numeral image preprocess extract
ratio and Euler value of numeral is extracted. Recognition of
graph from the image, namely converting image
the digit character is completed through combining the
into graph. Preprocess generally contains: bit
neural network and Bayes classifier respectively
manipulation, numeral string segmentation,
corresponding to the three type features. The recognition rate
character slant-correction and normalization and so
of this method is 90.73% on handwritten numeral database.
on [3,4]. The numeral stroke thickness is not
Keywords: Handwritten Numeral Recognition; Principal
important. The significant feature for the
Component; Feature Extracting; Combining Classifiers
recognition is the stroke style and configuration in
graph. But the disparity of stroke thickness still can
1 Introduction affect the performance of the recognition system. So
the normalization of stroke thickness is usually used
Handwritten numeral recognition is always the in the handwritten numeral image preprocess. It
research focus in the fields of image process and bases on Mathematical Morphology and can be
pattern recognition. The numeral varieties in size realized by first skeletonizing and then dilating
shapeslant and the writing style make the research them.
more hard[1]. There are two means to identify
different handwritten characters according to their 3 Handwritten Numeral
used traits, one is based on numerals structural
Feature Extractions
feature, and the other is based on their statistical
feature. Principal Component Analysis (PCA) as an The fundamental thinking of PCA is to seek the
effective feature extraction method, it can combine optimal subspace, in which we can gain the
structural and statistical features [2]. This article, maximum variance of component when projecting
based on PCA, combined with of number high dimension data x . At the same time, when
characters height-width ratio eigenvalue and Eular using the new component to reconstruct the original
eigenvalue, studies a new method to improve the data, we can gain the minimum mean-squared error.
identifying rate of offline hand-writing numbers. The preprocessed handwritten numeral in this
paper is 1 h 16 image, convert it into
vector: x i >x 1 , x 2 , " , x 256 @T , i 0 ,1, " ,9 , Z i
i
denotes the class of sample, the mean m and
1-4244-1135-1/07/$25.00 2007 IEEE.
1-298
The Eighth International Conference on Electronic Measurement and Instruments ICEMI2007

i
covariance matrix C are: 256
i

m i E xi ; ^ `
Ci E x i  m ^ i
x i
m i

T
` j
n
O j

D n i 256
i  1
d a
(1) i

j
O j
The
eigenvalue of covariance matrix 1

is O t " t O , the corresponding orthogonal


i
1
i
256
(2)
normalized eigenvector is U i >u 1 , u 2 , " , u 256 @ . The D n i shows the loss degree of numeral
Eigenvector is the principal component of this class information, a is a fixed value and
numerical class, which describes the numeral a 0 ,1 .
i
structural information. The numeral x can be The numeral principal component feature cant
i
reconstructed completely in the feature space U , reflect the height-width ratio of every numeral class,
the feature vector of numeral is y x i >[ 1 , " , [ 256 @T , so 1 usually confuses with other numerals. The
y x i U i x i  m i . handwritten numeral height-width ratio r before
T
namely
Taking 400 digits 0 from Yonsei University slant-correction is greatly influenced by character
(Korea) digit database and applying PCA, the slant angle, so it cant reflect the numeral class
information and it is necessary to extract statistical
feature sub-images recovered from the principal
information of numeral height-width ratio after
component is shown in Fig.1. The first two lines are
slant-correction. Let the numeral height and width
the feature sub-images corresponding to separately are h and w , the height-width
u ~ u , the third line corresponding ratio r h w .
1 8

to u ~ u . This figure shows that the first In the case of not knowing the concrete form
of r class-condition probability density, the
253 256

eight feature sub-images can effectively describe


nonparametric method based on window function
the structural information of digit 0 and the last can estimate all kinds of conditional probability
four are unmeaning. As eigenvalue shows the density using train set. Parzen window function is:
1 1
contribution of corresponding feature sub-image in M u exp  u 2

2S 2
reconstructing numeral, it is allowed to decrease the
(3)
amount of principal components in the condition So the class-condition probability density of
that the remainders still can describe the class numeral Z i height-width ratio r is:
i
N
1
information of numeral. p r Z i i M r  r j
i

N j 1
i
N
1 1 1
i exp  r  r j
i

2

N j 1 2S 2
(4)
The estimation result shows that most of
numerals height-width ratio is respectively different
except 6 and 9, especially digit 1 is obviously
different from others.
Fig.1. digit 0 feature sub-images The topological property of image can be used
As the eigenvalue of covariance matrix C i
to describe the shape of flat area, it can keep
show the contribution of corresponding principal invariable only if no rupture happens. The Euler
component in reconstructing digit i, the amount of value is a kind of topological measurement. Euler
numeral principal components n i can be value: eu c  h , c and h separately denote
confirmed according to the eigenvalue. It is: the num of objects and holes in image.

1-299
The Eighth International Conference on Electronic Measurement and Instruments ICEMI2007

For the handwritten numeral, character is a e i x ^


min e j x ,
j 0 ,", 9
` x Zi 
complete object, so eu is determined by the
 
amount of holes h . Usually, the Euler value of 1; According to the rule above, make use of
2; 3; 4; 5; 7 in handwritten numerals is 1, different amount principal components to classify
0; 6; 9 is 0 and 8 is -1. Howeverbecause 3000 samples from test set, the result is shown in
of the differences in handwritten numeral image bit Tab.1. From the table we can see that: the
manipulation and writing habit, the wrong stroke or recognition rate is 84.57% when only using the first
redundant holes always turns up, the numeral Euler three principal components to reconstruct, it shows
value usually deviates from the real state. The Euler that these three components can illustrate the
value eu should only be discrete integer. The numeral major class information; the recognition
class-condition probability P eu Zi , i 0,",9 rate only increases a narrow range when the
can be estimated by the train set, it is: dimension grows from six to sixteen and with the
n eui accretion continuing, the rate begins to fall down,
P eu Z i , eu  5 , " ,5 .
Ni this shows the most class information of numerals
(5) concentrates on the first dozens principal
i
N is the amount of numeral Z i in train set,
components; as the structure of 0 is simple and
i
n eu is the amount of samples whose Euler value is the information concentrates on the first several
eu in N . And
i
eu 5

P eu Z i 1 . Statistical ones, its recognition rate falls down when


eu  5
components amount increases .
analysis indicates that the numeral Euler value can
The principal component feature of
effectively differentiate 389.
handwritten numeral uses neural network classifier,
applying the numeral sample database to train and
4 Design of Classifier
test, the maximum recognition rate can reach 85.6%.
i
Although attempting different methods to combine
The eigenvector matrix U of numeral Zi 
the result of the separate-training neural network
calculated by PCA is a standard orthogonal, so any
classifiers, the recognition rate just increases a little.
numeral x can be completely reconstructed in this Two patterns which is hard to classify in one feature
feature space. But when the amount of principal space maybe easy to classify in the other because
components m < 256, namely each numeral class different features describe different aspects of
Z i has fixed amount reconstruct vectors, the pattern. The class-condition probability density of
reconstructive ability of these vectors is the height-width ratio and class-condition probability of
Euler value can be easily estimated. The numeral
strongest for the numeral in their own class. So
posterior probability can be calculated according to
when the reconstruct vectors is fixed in different
the Bayes formula.
numeral spaces, the bias between numeral x and The posterior probability of test sample
reconstructed result x should be least in its own height-width ratio r is:
feature space.
P r Zi P Zi pr >P Z r ," , P Z r @
T

P Zi r 0 9
The reconstructive bias with fixed amount 9
principal components in the feature space U is i
P rZ P Z
j 0
j j
m
i T
e i
x x  u j x  m i
u i
j  m i


j 1 2 (8)
6 The posterior probability of test sample Euler
The classifying rule is : value eu is:
1-300
The Eighth International Conference on Electronic Measurement and Instruments ICEMI2007

P euZi P Zi Tab.2. In the table, the recognition rate has already


P Zi eu 9
risen above 90%. But the recognition rate of 7 is
P euZ P Z
j 0
j j
not as good as we think. In order to satisfy the
demand of high-belief in practical application, the
peu >P Z eu ,", P Z eu @
0 9
T
(9)
recognition system can refuse to classify, when the
The output of combing classifiers is: numeral class attribute is ambiguous, it is
P Z 0 x P Z 0 r P Z 0 eu (10) P Z i x max P Z j x , P Z i x t t x Z i 
j 0 ,", 9
C Combine p , p r , p eu #
P Z 9 x P Z 9 r P Z 9 eu
 
Classifying rule of combing classifiers is: t t >0 . 1,1@ is the refusing threshold. When the
cj max ^c i `, xZ j maximum value of posterior probability calculated
i 0 ," , 9
by classifiers P Z i x  t the refusing decision is
(11)
Using combining classifiers to classify 3000 adopted. The maximum recognition rate of the
samples from the test set, the result is shown in system can reach 97.64% by testing.

Tab.1. recognition result of the minimum reconstructive bias classer based on principal components
7KHDPRXQW
7KHDPRXQWRISULQFLSDO GLJLW GLJLW GLJLW GLJLW GLJLW GLJLW GLJLW GLJLW GLJLW GLJLW RIVDPSOHV
5HFRJQLWLRQUDWH
FRPSRQHQWV FRUUHFW
         
FODVVLILHG
            

 7KHDPRXQW            

 RIVDPSOHV            

 FRUUHFW            

 FODVVLILHG            

            

Tab.2. recognition result of combing classifiers

7KHVDPSOH 7KHUHFRJQLWLRQUHVXOW
FODVV          

 294 0 0 2 1 0 2 1 0 0

 0 273 15 0 0 7 3 2 0 0

 1 0 281 7 5 5 0 1 0 0

 1 0 2 280 1 8 0 0 8 0

 3 3 11 5 282 4 3 3 3 7

 4 3 7 16 10 262 6 4 4 8

 3 1 3 6 3 18 268 1 3 2

 1 0 51 7 1 1 0 235 0 4

 5 0 0 7 4 6 1 0 274 3

 1 1 8 6 2 3 0 3 3 273
The total amount of tested samples3000
The total amount correct classified2722 Recognition rate90.73%
The total amount wrong classified278 Error rate9.27%

1-301
The Eighth International Conference on Electronic Measurement and Instruments ICEMI2007

Author Biography
4 Conclusion Wan Junli: born in 1957. Professor, working in China Three
Gorges University, main research interest: weak signal
The handwritten numeral PCA feature detection and pattern recognition.
describes the statistical information of numeral
structural feature, and extracts its height-width ratio
and Euler value so as to cover its weakness of
describing the difference among some numerals.
Applying these three type features to the
handwritten numeral recognition, it has got
excellent experimental result. Its a promising
method to apply different style features in the
recognition system.

References

[1] Rejean Plamondon. On-Line and Off-Line Handwriting


Recognition: A Comprehensive Survey. IEEE Trans. on
Pattern Analysis and Machine Intelligence, 2000, 22(1):
63-84
[2] Rui TingShen ChunlinDing JianZhang Jinlin.
Handwritten Numeral Character Recognition based-on
Principal Component Analysis.Minityped Computer
System2005,26(2):289-292
[3] Cheng-Lin Liu, Kazuki Nakashima, Hiroshi Sako,
Hiromichi Fujisawa. Handwritten digit recognition:
investigation of normalization and feature extraction
techniques. PATTERN RECOGNITION, 2004, 37:
265-279.
[4] Wang YouweiLiu Jie. A Method of Tilted Inclination
in Handwritten Numeral Recognition.Engineering of
Computer2004,30 (11):128-137.
[5] Zhang Guohua.Research on Handwritten Numeral
Recognition Based-on Principal Component Analysis and
Multi-classifiers Combination.[A Dissertation for the
Degree of Master]Yi ChangThree Gorges University
2006.
[6] Mark Girolami, Chao He. Probability Density Estimation
from Optimally Condensed Data Samples. IEEE Trans. on
Pattern Analysis and Machine Intelligence, 2003,
25(10):1253-1264.

1-302

Вам также может понравиться