Вы находитесь на странице: 1из 6

116 (IJCNS) International Journal of Computer and Network Security,

Vol. 1, No. 2, November 2009

Lips Detection using Closed Boundary Watershed


and Improved H∞ Lips Tracking System
Siew Wen Chin1, Kah Phooi Seng2, Li-Minn Ang3 and King Hann Lim4
1
The University of Nottingham, School of Electrical and Electronic Engineering,
Jalan Broga, Semenyih, Selangor 43500, Malaysia.
{keyx8csw1, Jasmine.Seng2, Kenneth.Ang3, keyx7khl4}@nottingham.edu.my

red exclusion lips feature extraction. The log scale of the


Abstract: The audio-visual speech authentication (AVSA)
system which offers a user-friendly platform is extensively ratio green over blue colour space is suggested as the
growing for the ownership verification and network security. threshold to extract the lips area and its features.
The front-end Lips detection and tracking is the key to make the From the aforementioned methodologies, it is noticed that
overall AVSA a success. In this paper, the lips detection using most of the proposed system require face detection as the
closed boundary watershed approach and improved H∞ lips pre-requisite procedure [5, 6]. Furthermore, some of the
tracking system is presented. The input image is first segmented
appearance-based lips segmentation approaches [7] do not
into regions using watershed algorithm. The segmented region
is then sent for the lips detection formed by the cubic spline
offer the close-boundary segmentation which might yield the
interpolant lips colour clustering. An improved H∞ tracking loss of some crucial information for further visual speech
system based on the Lyapunov stability theory (LST) is then analysis. In this paper, an automatic lips detection system
designed to predict the lips location of the succeeding image. based on the watershed approach without the preliminary of
The proposed system possesses the advantages of casting off the face localization is proposed. The lips region which
preliminary face localization before the lips detection. possesses the closed-boundary characteristic is directly
Moreover, the image processing time is further reduced by only
segmented from the input image by casting off the face
processing the image within the adjustable small window around
the predicted point instead of keep screening the full size image detection process.
throughout the sequence of images. For the purpose of enhancing the efficiency of the overall
lips detection system, the lips tracking system is adopted
Keywords: Audio-visual speech authentication, lips detection
and tracking, watershed, H∞ filtering, Lyapunov stability theory.
into the system. The coordination of the successfully
detected lips region is passed to the improved H∞ tracking
1. Introduction system to predict the lips location on the succeeding
incoming image. The improved H∞ filtering based on the
With the aggressive growth of computer and LST is designed to give a better tracking ability compared to
communication networks technology, the security of the the conventional Kalman and H∞ filtering. The improved H∞
multimedia data transmitting and retrieving over the open possesses the LST characteristic where the tracking error
networks has drawn an extensive attention[1, 2]. would be asymptotically converge to zero as the time
Multimodal biometric authentication approaches [3], approach to zero, since the LST ensures the tracking system
especially audio-visual speech authentication [4] which is always in the stable condition and has the strong
offers an inconspicuous and user-friendly platform are robustness with respect to the bounded input disturbances
booming as a solution of ownership verification. [8].
Dealing with the audio-visual speech processing, the front After obtaining the predicted location from the
end lips detection and tracking system is a crucial process to aforementioned improved H∞ tracking system, the
make the overall system a success. There were numerous subsequent lips detection process would only be focused
lips detection approaches published in the past [5-7]. Jamal within the small window size image which set around the
et al. [5] proposed the lips detection in the normalized RGB predicted location. The area of the small window is
colour scheme, where the normalized image is first adjustable to suit to the circumstances where the subject is
segmented into skin and non-skin region using the moving forward to the detecting device and the lips size
histogram thresholding. The lips region is then detected would be gradually increased. The increment of the lips
from the skin pixels. Furthermore, the lips region region when the subject is forwarding would cause the
segmentation using multi-variate statistical parameter exiting of the lips out from the fixed window and yield the
estimators by connecting the component analysis and some loss of information for further visual analysis. If the
post-processing is presented by B. Goswami et al. [6]. The prediction from the tracking system is inaccurate and the
face region is first segmented and lower half of the face is lips region is unable to be retrieved, a full size image
extracted to classify skin and not skin regions. The lip processing would restart again. The overview of the
contour is obtained by further applying the connected proposed watershed lips detection and the modified H∞
component analysis and post-processing on the not skin tracking system is illustrated in Figure 1. This paper is
region. Besides, Lewis et al. [7] proposed the pixel-based organized as: Section 2 introduces the proposed watershed
(IJCNS) International Journal of Computer and Network Security, 117
Vol. 1, No. 2, November 2009

lips segmentation while section 3 discusses about the lips window for lips tracking is demonstrated in Section 4. Some
detection and verification process. Subsequently, the simulation results and analysis are shown in Section 5
proposed improved H∞ based on LST and an adjustable following by conclusion in Section 6.

Input the extracted marker and the detected edge (using Sobel
Image
filtering) from the filtered image is first superimposed and
Yes then only sent for the watershed segmentation. The marker-
Watershed
Watershed Segmentation controlled watershed only allows the local minima allocated
Segmentation Is (Within the Small
(Full Size Image) Repetition >5? Window) inside the generated marker which is hence reduces the
redundant catchment basins built from the undesired noise.
No Lips Detection
Lips Detection Yes
No (Small Window
The foreground and background markers are generated by
(Full Size Image)
Is Image) obtaining the regional maxima and minima using the
Lips-Skin Ratio
H∞ Prediction of < Threshold? morphological techniques, known as “closing-by-
Lips Location on No
the succeeding
Is reconstruction” and “opening-by-reconstruction”. The
Lips Region
image
Small Window
Detected? purpose of doing the morphological clean up is to remove
Around the Predicted
Location
the undesired defects and obtain a flat minima and maxima
Yes in each object.
Subsequent For the rain-flow watershed algorithm implemented in
Incoming
Image this section, 8-way connectivity is applied where each pixel
is connected to eight possible neighbours in vertical,
Figure 1. The overview of the lips detection and tracking horizontal and diagonal. Each pixel will points to the
system. minimum value among the eight neighbours and labeled
according to its direction. If none neighbour which holds the
2. Watershed Lips Segmentation lower value but the same as the current pixel values, it
The watershed algorithm is one of the popular image would turned into the regional minimum. Each regional
segmentation tools [9, 10] as it possesses the capability of minimum will form its own catchment basin and all the
closed boundary region segmentation. The watershed pixel would fall into a single minimum according to the
concept is inspired by the topographic studies which splits steepest descending path. The region boundary is formed by
the landscape into the several water catchment areas. the edges which separate the basins. The details of the rain-
Referring to the watershed transform, a grayscale digital flow watershed algorithm could be refer to [11].
image is evaluated as a topographic surface. Each of the After the watershed, region merging is applied to further
pixels is situated at a certain altitude level according to its reduce the over-segmentation by merging the catchment
gray level, where black (the intensity value is 0) corresponds basins which have the similarity in intensity value. If the
to the minimum altitude while white (the intensity value is ratio of the mean colour between two neighbourhood regions
255) on the other hand represents the maximum altitude. is less than the predefined threshold, the respective regions
The other pixels are distributed at a particular level between would be merged and become single region. The process is
these two extremes. repeated until there is no ratio greater than the threshold.
The watershed algorithm used in this paper is based on The process flow of the watershed transformation on the
the rain-flow simulation proposed in [11]. The algorithm digital image is depicted in Figure 2.
applies the falling rain concept where the drops fall from the
higher altitude to the minimum region (known as catchment
basin) following the steepest descent path. After the
watershed, the image is divided into several catchment
basins which created by its own regional minimum. Every
pixel would be labeled to a specific catchment basin number
as the outcome of the watershed transformation.
Although watershed algorithm offers close boundary
segmentation, this approach nevertheless might encounter Figure 2. The overview of the watershed lips
the over-segmentation problem. The total segmentation
regions might increase to thousand though only a small 3. Lips Detection and Verification
number of them are requested. The over-segmentation After attaining the segmented image from the previous
matter is due to the existing noise in the input image and as watershed transformation process, the output is passed to the
well the sensitivity of the watershed algorithm to the lips detection system to obtain the lips region. The
gradient image intensity variations. Dealing with this respective watershed segmented regions are checked with
matter, the input images is first passed to the non-linear the pre-trained lips colour cluster boundary, where only the
filtering for denoising purpose. Median filtering is chosen as region that falls within the boundary would be classified as
it is able to smooth out the observation noise from the image the lips region. The overall lips detection and the
while preserving the region boundaries [12]. Subsequently, verification system are shown in Figure 3.
118 (IJCNS) International Journal of Computer and Network Security,
Vol. 1, No. 2, November 2009

The Face Database (AFD) [13] is used to trained the lips The skin colour cluster boundary is generated by cropping
colour cluster boundary. 6 sets of 6x6 dimensions lips area the 20x20 dimensions skin area from every subject in the
are cropped from every subject in the database, and the total AFD, the cluster after morphological process with the CSI
number of 642 sets data is collected for clustering process. boundary is depicted in Figure 4(b).
The collected data is first converted from RGB into YCbCr
domain. As to avoid the luminance matter, only the 4. Improved H∞ Based on Lyapunov Stability
chrominance components, Cb and Cr are used. The Cb and Theory and an Adjustable Window for Lips
Cr components are plotted onto the Cb-Cr graph. Only the Tracking System
heavily plotted pixels value would be known as the lips
colour cluster, therefore, the final lips colour clustering after Referring to Figure 1, after successfully detecting the lips
the morphological closing process is illustrated in Figure region from the previous section, the current lips
4(a). coordination are passed to the improved H∞ which works as
the predictor to estimate the lips location from the
succeeding incoming image. Subsequently, a small window
is localized at the predicted coordination where the
subsequent watershed lips segmentation and detection
process would only be focused within the small window size
region, rather than keep processing the full size image for
the entire video sequences. The full size image screening
would be going through once again if the lips region is
failed to be detected. With the aid of lips tracking, the image
processing time for the overall lips detection system would
be reduced and it would be hence a credit for the hardware
implementation.
Figure 3. Lips detection and verification system. 4.1 The Adjustable Window
Instead of applying a fixed small window, an adjustable
window is applied in this section. This is due to the reason
that, the fixed window could only deal with the subject who
has a horizontal movement in front of the lips detection
device. The gradually increased of the lips size when the
subject moves towards the device would be failed to be fully
covered by the fixed small window, and it might cause the
failure of the subsequent watershed segmentation and as
Figure 4. (a) lips (b) skin colour clustering with cubic splint well the detection process. The exited lips region might
interpolant boundary. yield the loss of some important information from the
detected lips region for further analysis such as the visual
Subsequently, the generated lips colour cluster is speech recognition process. The drawback of the fixed small
encompassed by using the cubic spline interpolant (CSI) as window is illustrated in Figure 5, it shows the failure of
formulated in (1)-(2) to create the lips colour boundary. The entirely cover the lips region when the subject moving
CSI lips colour boundary is then saved for further lips forward to the detection device.
detection process where the segmented region from the
previous watershed transformation which falls into the
boundary is detected as the lips region.
 T1 ( x) if y1 ≤ y ≤ y2

 T2 ( x) if y 2 ≤ y ≤ y3
T ( x) = 
 M (a) (b) (c)
Tm−1 ( x) if y m−1 ≤ y ≤ y m (1) Figure 5. The problem of fixed small window when the

subject is moving forward as in (b) and (c).
Where Tk is the third degree polynomial defined as:
Tk ( x) = ak ( x − xk ) 3 + bk ( x − xk ) 2 + ck ( x − xk ) + d k (2) 4.2 The Improved H∞ for Lips Tracking System
For k=1, 2, 3…n-1 The improved H∞ filtering [21] based on LST for the lips
If the detected region is more than one region after the tracking system is elaborated as below. A linear, discrete-
aforementioned lips detection process, a further lips time state and measurement equation is denoted as:
verification system would be triggered to gain the final lips
region. The detected region from the watershed State equation : x n +1 = Axn + Bu n + wn
transformation which also fall onto the face region would (3)
only denoted as the final lips region. The face region is Measurement equation: y n = Cx n + v n (4)
detected using the similar methodology as the lips detection.
(IJCNS) International Journal of Computer and Network Security, 119
Vol. 1, No. 2, November 2009

Where x represent the system state while y is the measured 5. Simulation and Analysis
output. A is the transition matrix carrying the state value,
x n from time n to n + 1 while B used to link the input 5.1 Simulation of Watershed Lips Segmentation
vector u to the state variables, and C is the observation The AFD is used to evaluate the performance of the
model that maps the true state space to the observed space; watershed lips segmentation. Figure 6(b) shows the over-
wn and v n are the respective process and measurement segmentation problem with the direct watershed
noise. transformation of the input image as in Figure 6(a).
The state vector for the lips tracking in (3) consists the
centre coordination of the detected lips in the horizontal and
vertical position. A new adaptation gain for the H∞ filtering
is implemented based on the LST. The design concept is
referring to [13]. According to LST, the convergence of the
tracking error e(n) from the newly design adaptation gain
is guaranteed, it would asymptotically converge to zero as Figure 6. (a) Input image (b) Over-segmentation (c)
the time approaching infinity. Superimpose of gradient and marker images (c) Desired
Theorem 4.1: Given a linear parameter vector, H (n) and a segmented image.
desired output, d (n) , the state vector, x(n) is updated as: Figure 6(c) depicts the superimpose of the detected edge and
the markers. The desired segmentation output with a
x( n) = x( n − 1) + g ( n)α (n ) (5)
reduction of the redundant segmented region is as in Figure
6(d).
The adaptation gain of the improved H∞ filtering which has
the characteristic of ∆v < 0 is designed as: 5.2 Simulation of Lips Detection and Verification
g (n) =
H (n)
2
[α (n) − ( I − L(n) ) × e(n − 1)] x(n − 1)
2
The resultant segmented image from the aforementioned
H (n) x(n − 1) (6) watershed transformation is then passed to the lips detection
Where system. The detected lips region is as depicted in Figure 7. If
L(n) = [ I − γQP( n) + H T (n)V −1 H (n) P (n)]−1 (7) there is more than one segmented regions are detected as the
lips region, a further verification process would be triggered
P(n) = FP ( n) L( n) F + W
T (8)
to retrieve the final lips region as shown in Figure 8. Only
γ , Q, W , V are the user defined performance bound, the segmented region which fulfill the criteria of falling
weighting matrices for the estimation error, process and within the lips colour cluster boundary as well as situation
measurement noise respectively. within the face region would be detected as the lips region.

The α (n) , the priori prediction error is defined as:


α ( n) = d (n) − H T x( n − 1) (9)
The tracking error, e(n) is asymptotically converges to zero
as the time, n heading to infinity.

Proof: As to design the tracking system which fulfills the


Figure 7. (a) Input Image (b) Watershed segmented image
LST, the Lyapunov function v (n) is first defined as:
(c) Detected Lips Region.
v ( n ) = e 2 ( n) (10)
Referring to LST, if and only if ∆v (n) = V (n) − V (n − 1) < 0 ,
the selected v (n) would only be denoted as the true
Lyapunov function [13].

The difference between v (n) and v (n − 1) is as following:

(11) Figure 8. (a) Input Image (b) Final detected lips region (c)
Watershed segmented image and lips detection (d) Lips
By substituting the adaptation gain from (6) into (11),
verification with skin colour clustering

5.3Simulation of H∞ Lips Tracking


After successfully retrieving the lips region from the
previous section, the location of the current detected lips
(12)
region is passed to the lips tracking system which is going to
analyze in this section to predict the lips location of the
subsequent incoming image.
As to evaluate the tracking capability of the implemented
120 (IJCNS) International Journal of Computer and Network Security,
Vol. 1, No. 2, November 2009

improved H∞ filtering, some in-house video clips are decomposition," Ieee Transactions on Image
prepared using Canon IXUS-65 camera. The video sequence Processing, vol. 16, pp. 1956-1966, 2007.
is first converted into 256x256 dimensional images. Figure [2] S. Dutta, et al., "Network Security Using Biometric
9 shows some of the tracked lips location from the image and Cryptography," in Advanced Concepts for
sequences. Table 1 shows the average estimation error of the Intelligent Vision Systems, ed, 2008, pp. 38-44.
lips tracking system for the in-house prepared video [3] V. K. Aggithaya, et al., "A multimodal biometric
sequence on every 5th frames and every 10th frames. authentication system based on 2D and 3D palmprint
features," in Biometric Technology for Human
Identification V, Orlando, FL, USA, 2008, pp.
Table 1: Average estimation error
69440C-9.
[4] G. Chetty and M. Wagner, "Robust face-voice based
Average Improved Conventional speaker identity verification using multilevel fusion,"
Estimation H∞ H∞ Image and Vision Computing, vol. 26, pp. 1249-1260,
Error (No. Every Every Every Every 2008.
of pixel) 5th 10th 5th 10th [5] J. A. Dargham and A. Chekima, "Lips Detection in the
Frames Frames Frame Frames Normalised RGB Colour Scheme," in Information and
y-position 2.61 4.87 3.35 6.14 Communication Technologies, 2006. ICTTA '06. 2nd,
x-position 7.08 14.50 9.28 18.29 2006, pp. 1546-1551.
[6] B. Goswami, et al., "Statistical estimators for use in
automatic lip segmentation," in Visual
MediaProduction, 2006. CVMP 2006. 3rd European
Conference on, 2006, pp. 79-86.
[7] T. W. Lewis and D. M. W. Powers, "Audio-visual
speech recognition using red exclusion and neural
networks," Journal of Research and Practice in
Information Technology, vol. 35, pp. 41-64, 2003.
[8] S. Kah Phooi, et al., "Lyapunov-theory-based radial
basis function networks for adaptive filtering," Circuits
and Systems I: Fundamental Theory and Applications,
IEEE Transactions on, vol. 49, pp. 1215-1220, 2002.
[9] J. Cousty, et al., "Watershed Cuts: Minimum Spanning
Figure 9. (a) The first input image (b) watershed Forests and the Drop of Water Principle," Pattern
segmentation on the full size image (c) detected lips region Analysis and Machine Intelligence, IEEE Transactions
on, vol. 31, pp. 1362-1374, 2009.
[10] E. Hodneland, et al., "Four-Color Theorem and Level
6. Conclusion Set Methods for Watershed Segmentation,"
International Journal of Computer Vision, vol. 82, pp.
A lips detection based on the closed boundary watershed and 264-283, 2009.
an improved H∞ lips tracking system is presented in this [11] V. Osma-Ruiz, et al., "An improved watershed
paper. The proposed system enables a direct lips detection algorithm based on efficient computation of shortest
without the preliminary face localization procedure. The paths," Pattern Recognition, vol. 40, pp. 1078-1090,
watershed algorithm which offers the closed-boundary 2007.
segmentation gives better information for further visual [12] N. Gallagher, Jr. and G. Wise, "A theoretical analysis
analysis. Subsequently, the improved H∞ filtering is of the properties of median filters," Acoustics, Speech
implemented to keep track the lips location on the and Signal Processing, IEEE Transactions on, vol. 29,
succeeding incoming video frames. Compared to the pp. 1136-1141, 1981.
conventional H∞, the improved H∞ which fulfills the LST [13] I. M. Lab, "Asian Face Image Database PF01," Pohang
shows a better tracking capability. With the aid of the University of Science and Technology.
tracking system and the adjustable small window, the
overall image processing time could be reduced since only a Authors Profile
small window size image is processed to obtain the lips
region instead of keep processing the full frame image Siew Wen Chin received her MSc and
throughout the entire video sequence. The overall proposed Bachelor degrees from the University of
Nottingham Malaysia Campus in 2008 and
system could be then implemented into the audio-visual
2006 respectively. She is currently pursuing
speech authentication system in the future. her Ph.D at the former campus. Her research
interests are in the fields of image and vision
processing, multi-biometrics and signal
References processing.
[1] N. Bi, et al., "Robust image watermarking based on
multiband wavelets and empirical mode
(IJCNS) International Journal of Computer and Network Security, 121
Vol. 1, No. 2, November 2009
Kah Phooi Seng received her Ph.D and
Bachelor degrees from the University of
Tasmania, Australia in 2001 and 1997
respectively. She is currently an Associate
Professor at the University of Nottingham
Malaysia Campus. Her research interests are
in the fields of intelligent visual processing,
biometrics and multi-biometrics, artificial
intelligence, and signal processing.

Li-Minn Ang received his Ph.D and Bachelor


degrees from Edith Cowan University,
Australia in 2001 and 1996 respectively. He is
currently an Associate Professor at the
University of Nottingham Malaysia Campus.
His research interests are in the fields of
signal, image, vision processing, intelligent
processing techniques, hardware architectures,
and reconfigurable computing.

King Hann Lim received his Master


Engineering from the University of
Nottingham, Malaysia campus in 2007. He is
currently doing his Ph.D at the same
University. He is a member of the Visual
Information Engineering Research Group. His
research interests are in the fields of signal,
image, vision processing, intelligent
processing techniques, and computer vision for intelligent ve