0 оценок0% нашли этот документ полезным (0 голосов)
8 просмотров9 страниц
This paper describes a voiceless speech recognition
technique that utilizes dynamic visual features to represent the facial movements during phonation. The
dynamic features extracted from the mouth video are
used to classify utterances without using the acoustic data. The audio signals of consonants are more
confusing than vowels and the facial movements involved in pronunciation of consonants are more discernible. Thus, this paper focuses on identifying consonants using visual information. This paper adopts
a visual speech model that categorizes utterances into
sequences of smallest visually distinguishable units
known as visemes. The viseme model used is based on
the viseme model of Moving Picture Experts Group
4 (MPEG-4) standard. The facial movements are
segmented from the video data using motion history
images (MHI). MHI is a spatio-temporal template
(grayscale image) generated from the video data using
accumulative image subtraction technique. The proposed approach combines discrete stationary wavelet
transform (SWT) and Zernike moments to extract rotation invariant features from the MHI. A feedforward
multilayer perceptron (MLP) neural network is used
to classify the features based on the patterns of visible
facial movements. The preliminary experimental results indicate that the proposed technique is suitable
for recognition of English consonants.
This paper describes a voiceless speech recognition
technique that utilizes dynamic visual features to represent the facial movements during phonation. The
dynamic features extracted from …