Вы находитесь на странице: 1из 5

2009 10th International Conference on Document Analysis and Recognition

Devanagari and Bangla Text Extraction from Natural Scene Images


U. Bhattacharya, S. K. Parui and S. Mondal Computer Vision and Pattern Recognition Unit, Indian Statistical Institute, Kolkata 108, India {ujjwal, swapan, srikanta_t}@isical.ac.in Abstract
With the increasing popularity of digital cameras attached with various handheld devices, many new computational challenges have gained significance. One such problem is extraction of texts from natural scene images captured by such devices. The extracted text can be sent to OCR or to a text-to-speech engine for recognition. In this article, we propose a novel and effective scheme based on analysis of connected components for extraction of Devanagari and Bangla texts from camera captured scene images. A common unique feature of these two scripts is the presence of headline and the proposed scheme uses mathematical morphology operations for their extraction. Additionally, we consider a few criteria for robust filtering of text components from such scene images. Moreover, we studied the problem of binarization of such scene images and observed that there are situations when repeated binarization by a well-known global thresholding approach is effective. We tested our algorithm on a repository of 100 scene images containing texts of Devanagari and / or Bangla and alignment of texts, background complexity, influence of luminance, and so on. A survey work of existing methods for detection, localization and extraction of texts embedded in images of natural scenes can be found in [1]. Two broad categories of available methods are connected component (CC) based and texture based algorithms. The first category of methods segments an image into a set of CCs, and then classifies each CC as either text or non-text. CC-based algorithms are relatively simple, but often they fail to be robust. On the other hand, texture-based methods assume that texts in images have different textural properties compared to the background or other non-text regions. Although the algorithms of the latter category are more robust, they have usually higher computational complexities. Additionally, a few authors studied various combinations of the above two categories of methods. Among early works, Zhong et al. [2] located text in images of compact disc, book cover, or traffic scenes in two steps. In the first step, approximate locations of text lines were obtained and then text components in those lines were extracted using color segmentation. Wu et al.[3] proposed a texture segmentation method to generate candidate text regions. A set of feature components is computed for each pixel and these are clustered using K-means algorithm. Jung et al. [4] employed a multi-layer perceptron classifier to discriminate between text and non-text pixels. A sliding window scans the whole image and serves as the input to a neural network. A probability map is constructed where high probability areas are regarded as candidate text regions. In [5], Li et al. computed features from wavelet decomposition of grayscale image and used a neural network classifier for labeling small windows as text or non-text. Gllavata et al. [6] considered wavelet transform based texture analysis for text detection. They used K-means algorithm to cluster text and nontext regions. Saoi et al. [7] used a similar but improved method for detection of text in natural scene images. In this

1. Introduction
Digital cameras have now become very popular and it is often attached with various handheld devices like mobile phones, PDAs etc. Manufacturers of these devices are now-a-days looking for embedding various useful technologies into such devices. Prospective technologies include recognition of texts in scene images, text-to-speech conversion etc. Extraction and recognition of texts in images of natural scenes are useful to blind and foreigners with language barrier. Furthermore, the ability to automatically detect text from scene images has potential applications in image retrieval, robotics and intelligent transport systems. However, developing a robust scheme for extraction and recognition of texts from camera captured scenes is a great challenge due to several factors which include variations of style, color, spacing, distribution
978-0-7695-3725-2/09 $25.00 2009 IEEE DOI 10.1109/ICDAR.2009.178 171

attempt, wavelet transform is applied to all of R, G and B channels of input color image separately. Ezaki, Bulacu and Schomaker [8] studied morphological operations for detection of connected text components in images. They used a disk filter obtaining the difference between the closing image and the opening image. The filtered images are binarized to extract connected components. In a recent work, Liu et al. [9] used a Gaussian mixture distribution to model the occurrence of three neighbouring characters and proposed a scheme under Bayes framework for discriminating text and non-text components. Pan et al. [10] used a sparse representation based method for the same purpose. Ye et al. [11] proposed a coarse-to-fine strategy using multiscale wavelet features to locate text lines in color images. Text segmentation method described in [12] uses a combination of a CC-based stage and a region filtering stage based on a texture measure. Devanagari and Bangla are the two most popular Indian scripts used by more than 500 and 200 million people respectively in the Indian subcontinent. A unique and common characteristic of these two scripts is the existence of certain headlines as shown in Fig. 1. The focus of the present work is to exploit the above fact for extraction of Devanagari and Bangla texts from images of natural scenes. The only assumption we make is that the characters are sufficiently large and/or thick so that using a linear structuring element of a certain fixed length can capture its headlines. To the best of our knowledge, no existing work deals with the same problem.

The rest of this article is organized as follows. Section 2 describes the preprocessing operations. The proposed method is described in Section 3. Experimental results are provided in Section 4. Section 5 concludes the paper.

2. Preprocessing
Size of an input image varies depending upon the resolution of the digital camera. Usually, this resolution is 1 MP or more. Initially, we down sample the input image by an integral factor so that its size is reduced to the nearest of 0.25 MP. Next, it is converted to 8-bit grayscale image using the formula G = 0.299*R + 0.587*G + 0.114*B. In fact, there is no absolute reference for weight values of R, G and B. However, the above set of weights is standardized by NTSC (National Television System Committee) and its usage is common in computer imaging. A global binarization method like the well-known Otsu's technique is usually not suitable for camera captured images since the gray-value histogram of such an image is not bi-modal. Binarization of such an image using a single threshold value often leads to loss of textual information against the background. Texts in the images of Figs. 2(a) and 2(b), are lost during binarization by Otsus method.

(a) Headline (a) Headline (c)

(b)

(b)

(d)

Figure 1. (a) A piece of text in Devanagari, (b) a piece of text in Bangla The present study is based on a set of 100 outdoor images of signboards, banners, hoardings and nameplates collected using two different cameras. Connected components (both black and white) are extracted from the binary image. Then, we use morphological opening operation along with a set of criteria to extract headlines of Devanagari or Bangla texts. Next, we use several geometrical properties of the characters of these two scripts to locate the whole text parts in relation to the detected headlines.

Figure 2. (a) and (b) Two scene images, (c) and (d) after binarization of (a) and (b) by Otsus method On the other hand, local binarization methods are generally window-based and the choice of window size in such methods severely affect the result producing broken characters, if the characters are thicker than the window size. We implemented an adaptive thresholding technique which use the simple average gray value in a window of size 2727 around a pixel as the threshold for that pixel. In Fig. 3, we show the binarization results of the images of Figs. 2 by this

172

adaptive method. However, the example in Fig.3 (b) has text components connected with the background and similar situations occurred frequently with the scene images used during our experimentations. Also, the latter stages of the proposed method cannot recover from this error.

(a) (b) Figure 3. (a) & (b) After binarization of images in Figs. 2(a) & 2(b) by adaptive method On the other hand, we observed that applying Otsu for the second time separately on both the sets of foreground and background pixels of the binarized image often recover lost texts efficiently. The second time use of Otsus method as described above convert several pixels from foreground to background and also vice versa. Final results of applying Otsu's method twice on input images of Fig. 2 are shown in Fig. 4.

(a) (b) Figure 4. Results of binarization by applying Otsus method two times; (a) the binarized image of the sample in Fig. 2(a), (b) the binarized image of the sample in Fig. 2(b).

3. Proposed approach for text extraction


Extraction of Devanagari and / or Bangla texts from binarized images is primarily based on the unique property of these two scripts that they have headlines as in Fig. 1. Basic steps of our approach, summarized below, are executed separately on resulting images of first and second time binarization.

3.1. Algorithm
Step 1: Obtain connected components (C) from the binary image (B) corresponding to the gray image (A). These include both white and black components. Step 2: Compute all horizontal or nearly horizontal line segments by applying morphological opening operation (Section 3.2) on each C. See Fig. 5(a). Step 3: Obtain connected sets of the above line segments. If multiple connected sets are obtained from

same C, then we consider only the largest one and call it the candidate headline component HC. Step 4: Let E denote a component C that produces a candidate headline component HC. Replace E by subtracting HC from it. Thus, E may now get disconnected consisting of several connected components. Step 5: For each E, compute H1 and H2, which are respectively the heights of the parts of E that lie above and below HC. Step 6: Obtain the height (h) of each connected component F of E that lies below HC. Compute p = the standard deviation of h divided by the mean of h, for each E. Step 7: If both H1 / H2 and p are less than two suitably selected threshold values, call the corresponding HC as the true headline component, HT. Here, it should be noted that the characters of Devanagari and Bangla always have a part below the headline and a possible part above the headline is always smaller than the part below it. Step 8: Select all the components C corresponding to each true headline component HT. Step 9: Revisit all the connected components, which have not been selected above. For each such component we examine whether any other component in its immediate neighborhood has already been selected. If so, we compare the gray values of the two concerned components in image A and if these values are very close, then we include the former component into the set of already selected components. As an example, we consider the binarized image of Fig. 4(a). All the line segments produced after the morphological operations on each component is shown in Fig. 5(a). Points on horizontal line segments obtained from white components are represented by the gray color while the same for black components are represented by black color. Candidate headlines obtained at the end of step 3 are shown in Fig. 5(b). Result of subtracting candidate headline components from respective parent components is shown in Fig. 5(c). True headline components obtained at the end of Step 7 are shown in Fig. 5(d). Text components selected by Step 8 are shown in Fig. 5(e). Finally, a few other possible text components are selected by the last step and the final set of selected components are shown in Fig. 5(f). In the above particular example, all the text components have been selected. However, only one non-text component (at the bottom of the image) has also been selected.

173

(a)

(b)

(c)

(d)

is entirely contained in A. For object A and structuring element B, the eroded object A-B is shown in Fig. 6(c). The dilation operation is in some sense dual of Erosion. For each pixel P in the object A, consider the placement B(P) of the structuring element B with its center at P. Then Dilation of object A by structuring element B, denoted by A+B, is defined as the union of such placements B(P) for all P in A. Opening of A by the element B is (A-B)+B and it is shown in Fig. 6(d). It is evident that opening of an object A with a linear structuring element B can effectively identify the horizontal line segments present in a connected component. However, a suitable choice of the length of this structuring element is crucial for processing of the latter stage and we empirically selected its length as 21 for the present problem.

(e)

(f)

Figure 5. Results of different stages of the algorithm based on the image of Fig. 2(a); (a) all line segments obtained by morphological operation, (b) set of candidate headlines, (c) all the components minus the respective candidate headlines, (d) true headlines, (e) components selected corresponding to true headlines, (f) final set of selected components.
XXXX XXXXXXXXXXXXX XXXXXXXXXXXXXXXXXX XXXXXXXXX XXX XXX XXX XXX XXX

(a)

(b)

(c)

(d)

XXXXX

(a)
XXXXXXXXX XXXXXXXXXXXXXX XXXXX

(b)
XXXXXXXXXXXXX XXXXXXXXXXXXXXXXXX XXXXXXXXX

(e)

(f)

(c)

(d)

Figure 6. (a) An object (A), (b) a structuring element (B), (c) eroded object (C = A-B), (d) the object after opening (D = (A-B)+B)

3.2. Morphological operation


We apply mathematical morphology tools such as erosion followed by dilation on each connected component to extract possible horizontal line segments. For illustration, consider object A and structuring element B as shown in Figs. 6(a) and 6(b) respectively. The erosion of object A by the structuring element B, denoted by A-B, is defined as the set of all pixels P in A such that if B is placed on A with its center at P, B

(g) (h) Figure 7. A few images on which our algorithm performed perfectly & respective output.

4. Experimental results
We obtained simulation results based on 100 test images acquired by (i) a Kodak DX7590 (5.0 MP) still camera and (ii) a SONY DCR-SR85E handy cam used in still mode (1.0 MP). Resolution of images captured by these two cameras are respectively 25761932 and 644483 pixels. After downsampling their sizes are reduced to 644483 and 576432 pixels respectively. These are of highways, Institutions, railway station,

174

festival ground etc. These are focused on names of building / shop / railway station / financial institutions or hoardings towards advertisements. These contain Devanagari and Bangla texts of various font styles, sizes, and directions. A few of the images on which the algorithm perfectly extracts all the Bangla and Devanagari text components are shown in Fig. 7. There are 58 such images all of whose relevant text components could be extracted. On the other hand, two of the sample images on which the performance of our algorithm is extremely poor are shown in Fig. 8. Similar poor performance occurred with 6 of our sample images. On rest of the 36 images the algorithm either partially extracted relevant text components or extracted text along with a few non-text components. In summary, the precision and recall values of our algorithm obtained on the basis of the present set of 100 images are respectively 68.8% and 71.2%.

Figure 9. Two images consisting of curved or slanted texts. Extracted components are shown to the right of each source image. In future, we shall study use of machine learning tools to improve the performance of the proposed algorithm.

References
[1] J. Liang, D. Doermann, H. Li, Camera based analysis of text and documents : a survey, Int. Journ. on Doc. Anal. and Recog. (IJDAR) vol. 7, pp. 84-104, 2005. [2] Y. Zhong, K. Karu, A. K. Jain, Locating text in complex color images, 3rd International Conference on Document Analysis and Recognition, vol. 1, 1995, pp. 146-149. [3] V. Wu, R. Manmatha, E. M. Risemann, Text Finder: an automatic system to detect and recognize text in images, IEEE Transactions on PAMI, vol. 21, pp. 1224-1228, 1999. [4] K. Jung, K. I. Kim, T. Kurata, M. Kourogi, J. H. Han, Text Scanner with Text Detection Technology on Image Sequences, Proceedings of 16th International Conference on Pattern Recognition (ICPR), vol. 3, 2002, pp. 473-476. [5] H. Li, D. Doermann, O. Kia, Automatic text detection and tracking in digital video, IEEE Trans. Image Processing, vol. 9, no. 1, pp. 147-167, 2000. [6] J. Gllavata, R. Ewerth, B. Freisleben, Text Detection in Images Based on Unsupervised Classification of High Frequency Wavelet Coefficients, Proc. of 17th Int. Conf. on Pattern Recognition (ICPR), vol. 1, 2004, pp. 425-428. [7] T. Saoi, H. Goto, H. Kobayashi, Text Detection in Color Scene Images Based on Unsupervised Clustering of Multihannel Wavelet Features, Proc. of 8th Int. Conf. on Doc. Anal. and Recog. (ICDAR), pp. 690-694, 2005. [8] N. Ezaki, M. Bulacu, L. Schomaker, Text detection from natural scene images: towards a system for visually Impaired Persons, Proc. of 17th Int. Conf. on Pattern Recognition, vol. II, pp. 683-686, 2004. [9] X. Liu, H. Fu, Y. Jia, "Gaussian mixture modeling and learning of neighboring characters for multilingual text extraction in images", Pattern Recognition, vol. 41, pp. 484 493, 2008. [10] W. Pan, T. D. Bui, C. Y. Suen, Text Detection from Scene Images Using Sparse Representation, Proc. of the 19th International Conference on Pattern Recognition, 2008. [11] Q. Ye, Q. Huang, W. Gao, D. Zhao, Fast and robust text detection in images and video frames, Image and Vision Computing, 23, pp. 565576, 2005. [12] C. Merino, M. Mirmehdi, A framework towards realtime detection and tracking of text, 2nd Int. Workshop on Camera-Based Doc. Anal. and Recog., pp. 1017, 2007.

(a)

(b)

(c) (d) Figure 8. Two sample images on which the performance of our algorithm is very poor.

5. Conclusions
The proposed algorithm works well even on slanted or curved text components of Devanagari and Bangla. One such situation is shown in Fig. 9. However, the proposed algorithm will fail whenever the size of such curved or slanted text is not sufficiently large.

(a)

(b)

(c)

(d)

175