Вы находитесь на странице: 1из 8

This article has been accepted for publication in IEEE Computer Graphics and Applications but has not

yet been fully edited.


IEEE COMPUTER GRAPHICS AND APPLICATIONS,Some content
VOL. XX, may
NO. XX, change
JULY 2014 prior to final publication. 1

A Practical Model for Live Speech Driven


Lip-sync
Li Wei, and Zhigang Deng, Senior Member, IEEE

Abstract—In this article we introduce a simple, efficient, yet practical the synthesis process. Second, live speech driven lip-sync al-
phoneme-based approach to generate realistic speech animation in gorithms need to be highly efficient to ensure a real-time speed
real-time based on live speech input. Specifically, we first decom- of generating speech animation on an off-the-shelf computer,
pose lower-face movements into low-dimensional principal component
while off-line speech animation synthesis algorithms do not
spaces. Then, in each of the retained principal component spaces, we
select the AnimPho with the highest priority value and the minimum need to meet such a tight constraint. Compared with the case
smoothness energy. Finally, we apply motion blending and interpolation of the forced phoneme alignment for pre-recorded speech,
techniques to compute final animation frames for the currently inputted the last challenge comes from the low accuracy of state-of-
phoneme. Through many experiments and comparisons, we demon- the-art live speech phoneme recognition systems (e.g., the
strate the realism of synthesized speech animation by our approach as Julius system (http://julius.sourceforge.jp) or the HTK toolkit
well as its real-time efficiency on an off-the-shelf computer.
(http://htk.eng.cam.ac.uk)).
Index Terms—Facial animation, speech animation, live speech driven, In order to quantify the phoneme recognition accuracy dif-
talking avatars, virtual humans, data-driven ference between the pre-recorded speech and live speech cases,
we performed an empirical study as follows. We randomly se-
lected 10 pre-recorded sentences and extracted their phoneme
1 I NTRODUCTION sequences using the following two different approaches: (1)
The Julius system was used to do forced phoneme-alignment
In signal processing and speech understanding communities,
on the pre-recorded 10 speech clips (called offline phoneme-
several approaches have been proposed to generate speech
alignment), and (2) The same Julius system was used as a
animation based on live acoustic speech input. For exam-
real-time phoneme recognition engine. In other words, by
ple, based on a real-time recognized phoneme sequence,
simulating the same pre-recorded speech clip as live speech, it
researchers use simple linear smoothing functions to produce
outputted phonemes sequentially while the speech was being
corresponding speech animation [1], [2]. Meanwhile, a number
fed into the system. Then, by taking the offline phoneme-
of approaches train statistical models (e.g., neural networks)
alignment results as the ground-truth, we can compute the
to encode the mapping between acoustic speech features and
accuracies of the live speech phoneme recognition in our
facial movements [3], [4]. These approaches demonstrated
experiment. As illustrated in Figure 1, the live speech phoneme
their real-time runtime efficiency on an off-the-shelf computer;
recognition accuracy of the same Julius system varies from
however, their performance is highly speaker-dependent due
45% to 80%. Further empirical analysis did not show any
to the individual-specific nature of the chosen acoustic speech
patterns of incorrectly recognized phonemes (that is, which
features. Furthermore, due to the insufficient visual realism of
phonemes are often recognized incorrectly in live speech
these approaches, practically they are less suitable for graphics
case). This empirical finding implies that in order to pro-
and animation applications.
duce satisfactory live speech driven speech animation results,
Challenges of live speech driven lip-sync. First, live
any phoneme-based algorithm must take the relatively low
speech driven lip-sync imposes additional technical challenges
phoneme recognition accuracy (in the case of live speech)
than the off-line case where expensive global optimization
into design consideration, and it should be able to perform
techniques can be employed to solve the most plausible
certain self-correction at runtime, since some phonemes could
speech motion corresponding to novel spoken or typed input.
be incorrectly recognized and inputted into the algorithm, in
In contrast, it is extremely difficult, if not impossible, to
a less predictable manner.
directly apply such global optimization techniques to live
Inspired by the above research challenges, in this paper we
speech driven lip-sync applications, since the forthcoming
propose a practical phoneme-based approach for live speech
(unavailable yet) speech content cannot be exploited during
driven lip-sync. Besides generating realistic speech animation
in real-time, our phoneme-based approach can straightfor-
• L. Wei and Z. Deng are with the Computer Graphics and Interactive Media
Lab and the Department of Computer Science, University of Houston,
wardly handle speech input from different speakers, which
Houston, TX 77204-3010. is one of the major advantages of phoneme-based approaches
• E-mail: zdeng4@uh.edu. over acoustic speech feature driven approaches [3], [4]. Specif-
ically, we introduce an efficient, simple algorithm to compute
Digital Object Indentifier 10.1109/MCG.2014.105 0272-1716/$26.00 2014 IEEE
This article has been accepted for publication in IEEE Computer Graphics and Applications but has not yet been fully edited.
IEEE COMPUTER GRAPHICS AND APPLICATIONS,Some content
VOL. XX, may
NO. XX, change
JULY 2014 prior to final publication. 2

1
acoustics features used for training these statistical audio-to-

Live Phoneme Recognition Accuracy


0.9
0.8
0.7
visual mapping models are highly speaker-specific, as pointed
0.6 out by Taylor et al. [10].
0.5
0.4
0.3
0.2
3 DATA ACQUISITION AND P REPROCESSING
0.1
0
We acquired a training facial motion dataset for this work
1 2 3 4 5 6 7 8 9 10
Sentence Number using an optical motion capture system. We attached over
100 markers on the face of a female native English speaker.
Fig. 1. Illustration of the calculated live speech phoneme 3D rigid head motion was eliminated using a SVD-based
recognition accuracies of the Julius system, compared to statistical shape analysis method [9]. The captured subject
its offline phoneme-alignment results. was guided to speak a phoneme-balanced corpus consisting of
166 sentences with neutral expression. The obtained dataset
contains 72,871 motion frames (about 10 minutes recording,
the priority of each motion segment and select the plausible 120 frames per second). Phoneme labels and the durations of
segment based on the phoneme information that is sequentially the phonemes were automatically extracted from the simulta-
recognized at runtime. Through our experiments and various neously recorded speech data.
comparisons, we evaluate the accuracy and efficiency of our As illustrated in Figure 3(a), 39 markers in the lower
live speech driven lip-sync approach. Compared to existing lip- face region are used in this work, which results in a 117-
sync approaches (reviewed in Section 2), the main advantages dimensional feature vector for each motion frame. We apply
of our approach are its efficiency, simplicity practicalness, and Principal Component Analysis (PCA) to reduce the dimension
capability of handling live speech input in real-time. of the motion feature vectors, which allows us to obtain
a compact representation by only retaining a small number
of principal components. In this work, we keep the five
2 R ELATED W ORK
most significant principal components to cover 96.6% of the
Visual speech animation synthesis. The essential part of variance of the motion dataset. It is noteworthy that all the
visual speech animation synthesis is how to model the speech other processing steps in this writing are in parallel performed
co-articulation effect. Conventional viseme driven approaches in each of the retained five principal component spaces.
need users to first carefully design a set of visemes and then Motion segmentation: For each recorded sentence, based
employ interpolation functions [5] or co-articulation rules to on its phoneme alignment result, we segment its motion
compute in-between frames for speech animation synthesis [6]. sequence accordingly and extract a motion segment for each
However, a fixed phoneme-viseme mapping scheme is often phoneme occurrence. For the purpose of motion blending later,
insufficient to model the co-articulation phenomenon in human we keep an overlapping region between two neighboring mo-
speech production. As such, the resulting speech animations tion segments. Currently, we set the length of the overlapping
often lack variance and realistic articulation. region to one frame, as illustrated in Figure 2.
Instead of interpolating a set of pre-designed visemes, one
category of data-driven approaches generate speech animations s i -1 s i+1
by optimally selecting and concatenating motion units from
a pre-collected database based on various cost functions [7], … …
[8], [9], [10], [11]. The second category of data-driven speech si
animation approaches learn statistical models from data [12],
[13]. All these data-driven approaches have demonstrated Fig. 2. Illustration of motion segmentation. The grids
noticeable successes for offline pre-recorded speech animation represent motion frames. si denotes the motion segment
synthesis. Unfortunately, they cannot be straightforwardly ex- corresponding to phoneme pi . When we segment a mo-
tended for live speech driven lip-sync. The main reason is that tion sequence based on its phoneme timing information,
they typically utilize some forms of global optimization tech- we keep one overlapping frame (grids with slash filled)
niques to synthesize the most optimal speech-synchronized between two neighboring segments. We evenly sample
facial motion. In contrast, in the case of live speech driven five frames (grids with red color) to represent a motion
lip-sync, forthcoming speech (or phoneme) information is segment.
simply unavailable; as such, directly applying such global
optimization based algorithms would be technically infeasible. Motion segment normalization: Because the duration of a
Live speech driven facial and gesture animation. A num- phoneme is typically short (in our dataset, the average duration
ber of algorithms have been proposed to generate live speech of a phoneme is 109 milliseconds), we can use a small number
driven speech animations. Besides pre-designed phoneme- of evenly sampled frames to represent the original motion. In
viseme mapping [1], [2], various statistical models (e.g., neural our approach, we down-sample a motion segment by evenly
networks [3], nearest-neighbor search [4], and others) have selecting five representative frames (see Figure 2). To the
been trained to learn audio-visual mapping for this purpose. end, we represent each motion segment si as a vector vi that
The common disadvantages of these approaches is that the concatenates the PCA coefficients of the five sampled frames.
Digital Object Indentifier 10.1109/MCG.2014.105 0272-1716/$26.00 2014 IEEE
This article has been accepted for publication in IEEE Computer Graphics and Applications but has not yet been fully edited.
IEEE COMPUTER GRAPHICS AND APPLICATIONS,Some content
VOL. XX, may
NO. XX, change
JULY 2014 prior to final publication. 3

Clustering: We apply K-Means clustering algorithm to the φ (pt ) is empty (called a miss in this paper), then we reassign
motion vector dataset {vi : i = 1...n}, where n is the total all the AnimPhos (regardless their phoneme labels) as φ (pt ).
number of motion vectors. Euclidean distance is used to
compute the distance between two motion vectors. The center at = argmin dist(E(at−1 ), B(η)), (1)
η∈φ (pt )
of each obtained motion cluster is called an AnimPho in this
writing. Finally, we obtain a set of AnimPhos {ai : i = 1...m}, where dist(E(at−1 ), B(η)) returns the Euclidean distance
where m is the number of AnimPhos and ai is the i-th between the ending frame of at−1 and the starting frame of
AnimPho. η. If L(pt ) ∧ T (at−1 ) is not empty, this method can ensure
After the above clustering step, we essentially convert the the co-articulation by finding the AnimPho at that is naturally
motion dataset to a set of AnimPho sequences. Then, we connected to at−1 in the captured dataset. However, it does
further pre-compute the following two data structures: not work well in practical applications. The main reason is
• AnimPho transition table, T . It is a m × m table, where that, the missing rate of this method is typically very high
m is the number of AnimPhos. T (ai , a j ) = 1 if the An- due to the limited size of the captured dataset; as a result, the
imPho transition < ai , a j > exists in the motion dataset, synthesized animations are often incorrectly articulated and
otherwise T (ai , a j ) = 0. In the remainder of this paper, we over-smoothed (i.e., only minimizing the smoothness energy
call two AnimPhos ai and a j are connected if and only if defined in Eq. 1).
T (ai , a j ) = 1. We also define T (ai ) = {a j |T (ai , a j ) = 1}
as the set of AnimPhos that are connected to ai .
• Phoneme-AnimPho mapping table, L. It is a h × m table, p1 p2 p3
where h is the number of used phonemes (43 English
phonemes are used in this work). L(pi , a j ) = 1 if the

1 4 7
phoneme label of a j is phoneme pi in the dataset,
otherwise L(pi , a j ) = 0. Similarly, we define L(pi ) =
{a j |L(pi , a j ) = 1} as the set of AnimPhos that have the
phoneme label pi . 2 5 8
3 6 9
miss miss

Fig. 4. An illustrative example of the naive AnimPho


selection strategy: p1 to p3 on the top is the phonemes
that sequentially come to the system. Squares below
each phoneme pi is L(pi ), and the number in the squares
denotes the index of AnimPho. Dotted directional lines
denote existing AnimPho transitions in the dataset. The
(a) (b)
AnimPhos with green color are the AnimPhos that are
Fig. 3. Illustration of the used facial motion dataset. (a) selected based on the naive AnimPho selection strategy.
Among the 102 markers, the 39 green markers are used
in this work. (b) Illustration of the average face markers. A simple example of this naive AnimPho selection strategy
is illustrated in Figure 4. Assuming we initially select the
AnimPho #3 for phoneme p1 . When p2 comes, we find that
none of AnimPhos in L(p2 ) can connect to the AnimPho #3 in
4 L IVE S PEECH D RIVEN L IP -S YNC the dataset, so we select the one with a minimum smoothness
energy (assuming it is the AnimPho #6). When p3 comes,
In this work, two steps are used to generate lip motion based again we find that none of AnimPhos in L(p3 ) can connect to
on live speech input: AnimPho selection, and motion blending. the AnimPho #6, so we select an AnimPho (assuming it is the
We will describe two steps in detail in this section. AnimPho #7) based on the smoothness energy. The missing
rate of this simple example would be 100%, and the natural
AnimPho connection information (dotted directional lines in
4.1 AnimPho Selection
Figure 4) is not exploited in this example.
Assuming phoneme pt arrives at the lip-sync system at time t, To solve both the over-smoothness and inaccurate artic-
our algorithm needs to find its corresponding AnimPho, at , to ulation issues, we need to properly utilize the AnimPho
animate pt . A naive solution would be to take the AnimPho connection information especially when the missing rate is
set, φ (pt )=L(pt ) ∧ T (at−1 ), which is the set of AnimPhos that high. In our approach, we compute the priority value, vi , for
have the phoneme label pt and have a natural connection with each AnimPho ai ∈ L(pt ) based on the pre-computed transition
at−1 , and choose one of them as at by minimizing a cost information. Specifically, the priority value of an AnimPho ai
function (e.g., the smoothness energy defined in Eq. 1). If is defined as the length of the longest connected AnimPho
Digital Object Indentifier 10.1109/MCG.2014.105 0272-1716/$26.00 2014 IEEE
This article has been accepted for publication in IEEE Computer Graphics and Applications but has not yet been fully edited.
IEEE COMPUTER GRAPHICS AND APPLICATIONS,Some content
VOL. XX, may
NO. XX, change
JULY 2014 prior to final publication. 4

Our method
sequence corresponding to the observed phoneme sequence 0.9 The naive method
(up to pt ). The priority value vi of ai is calculated as follows: 0.8
0.7

Missing Rate
( 0.6
max(v j + 1) ∃a j ∈ L(pt−1 ), ai ∈ T (a j ) 0.5
vi = (2) 0.4
1 Otherwise 0.3
0.2
0.1
Then, we select AnimPho at by taking both the priority 0
value and the smoothness term into consideration. In our 1 2 3 4 5 6 7 8

approach, we prefer to select the AnimPho with the highest Sentence#


priority value; if multiple AnimPhos have the same high pri-
Fig. 6. The missing rate comparison of 8 sentences
ority values, we select the smoothest one when it is connected
between the naive AnimPho selection strategy and our
with at−1 .
approach.

p1 p2 p3 The limited size of the motion dataset. At most practical



applications, the acquired facial motion dataset typically
has a limited size due to data acquisition cost and other
reasons. For example, the average number of outgoing
1 1 4 2 7 1 transitions from an AnimPho in our dataset is 3.45. There-
fore, once we pick an AnimPho for pt−1 , the forward
2 1 5 2 8 3 AnimPho transitions for pt are limited (i.e., on average
3.45).
3 1 6 1 9 1 In theory, in order to dramatically reduce the missing rate
miss (i.e., close to zero), the above two issues need to be well
taken care of. In this work, rather than capturing a much larger
facial motion dataset to reduce the missing rate, we focus on
Fig. 5. An illustrative example of our priority-incorporated how to utilize a limited motion dataset to produce acceptable
AnimPho selection strategy: the number in a red box is live speech driven lip-sync results. In follow-up section, we
the priority value of an AnimPho. will introduce a motion blending step to handle the motion
discontinuity incurred by the missing event.
We use the same illustrative example in Figure 5 to explain
our algorithm. We initially select the AnimPho #3 for p1 . 4.2 Motion Blending
When p2 comes, we first compute the priority value for each In this section, we describe how to generate the final facial ani-
AnimPho in L(p2 ) using AnimPho transition information. mation using motion blending technique, based on the selected
The priority value of the AnimPho #4, v4 , is 2, since the AnimPho at (refer to Section 4.1) and the duration of phoneme
AnimPho sequence < 1, 4 > corresponds to the observed pt (outputted from a live speech phoneme recognition system
phoneme sequence < p1 , p2 >. Similarly, we can compute the such as Julius). Specifically, this technique consists of three
priority value of the AnimPho #5 as 2. After we compute steps: (1) The first step is to select an AnimPho bt that both
priority values, we select one AnimPho for p2 . Although connects to at−1 and has the most similar motion trajectory
we find that none of AnimPhos in L(p2 ) is connected to with at ; (2) the second step is to generate an intermediate
the AnimPho #3, instead of selecting the AnimPho #6 that motion segment mt based on at−1 , at , and bt ; (3) the final
minimizes the smoothness term, we choose the AnimPho #4 step is to do motion interpolation on mt to produce the desired
or the AnimPho #5 (both have the same higher priority value). number of animation frames for pt , based on the inputted
Assuming the AnimPho #5 has a smaller smoothness energy duration of pt .
than the AnimPho #4, we select the AnimPho #5 for p2 . When Selection of bt . As mentioned previously, at runtime our
phoneme p3 comes, AnimPho #8 has the highest priority live speech driven lip-sync approach still has a non-trivial
value, so we simply select the AnimPho #8 for p3 . missing rate (e.g., on average about 50%) at the AnimPho
As shown in Figure 6, compared with the naive AnimPho selection step. To solve the possible motion discontinuity issue
selection strategy, our approach can significantly reduce the between at−1 and at , we need to identify and utilize an existing
missing rate. Still, the missing rate of our approach is sub- AnimPho transition < at−1 , bt > in the dataset such that bt has
stantial (e.g., around 50%), due to the following two reasons: the most similar motion trajectory with at . Our rationale is that,
• Unpredictability of the next phoneme. At the AnimPho the starting frame of bt can ensure a smooth connection with
selection step of live speech driven lip-sync, the next the ending frame of at−1 , and bt can also have a similar motion
(forthcoming) phoneme information is unavailable, so we trajectory with at at most places. We identify the AnimPho bt
cannot predict which AnimPho, as the best selection for by minimizing the following cost function:
the current step, that would provide the optimal AnimPho bt = argmin dist(η, at ), (3)
transition to the next phoneme. η∈T (at−1 )

Digital Object Indentifier 10.1109/MCG.2014.105 0272-1716/$26.00 2014 IEEE


This article has been accepted for publication in IEEE Computer Graphics and Applications but has not yet been fully edited.
IEEE COMPUTER GRAPHICS AND APPLICATIONS,Some content
VOL. XX, may
NO. XX, change
JULY 2014 prior to final publication. 5

where dist(η, at ) returns the Euclidean distance between at


at
and η. Note that if at ∈ T (at−1 ), which means there exists
a natural AnimPho transition from at−1 to at in the dataset, bt

Motion
Eq. 3 is guaranteed to return bt = at .
Generation of an intermediate motion segment, mt .
After bt is identified, we first need to modify bt to ensure the
motion continuity between the previous animation segment,
pt-1 pt Time
at−1 , and the animation for pt that we are going to synthesize.
In this writing, we use bt0 to denote the modified version of
at
bt . Then, we need to generate an intermediate motion segment
b t’
mt by linearly blending bt0 and at .

Motion
All the AnimPhos (including bt , at−1 , and at ) store the
center of corresponding motion segment clusters, as described
in Section 3; they do not retain the original motion trajectory
in the dataset. As a result, the starting frame of bt and the pt-1 pt Time
ending frame of at−1 may be slightly mismatched although the
AnimPho transition < at−1 , bt > indeed exists in the captured at mt
motion dataset (refer to the top panel of Figure 7). To make the b t’ dt

Motion
starting frame of bt perfectly matched with the ending frame
of at−1 , we need to adjust (via a translation transformation)
bt by setting the starting frame of bt equivalent to the ending
frame of at−1 (refer to the middle panel of Figure 7).
Then, we linearly blend two motion segments, bt0 and at , pt-1 pt Time
to obtain an intermediate motion segment mt , as illustrated in
the bottom panel of Figure 7. As described in Section 3, we Fig. 7. Illustration of the motion blending and interpolation
use 5 evenly sampled frames to represent a motion segment. process. Top: After the AnimPho selection step, we can
Therefore, this linear blending can be described using the see that either at or bt has a smooth transition from the
following Equation 4. previous animation segment (denoted as the black solid
line). Middle: We set the starting frame of bt equivalent to
the ending frame of the previous animation segment (for
i−1 i−1 phoneme pt−1 ). Bottom: We linearly blend at and bt0 to
mt [i] = (1 − ) × bt0 [i] + × at [i], (1 ≤ i ≤ 5), (4)
4 4 obtain a new motion segment mt and further compute the
where mt [i] is the i-th frame of the motion segment mt . derivatives of mt . Finally, we apply Hermit interpolation to
obtain in-between frames.
Motion interpolation. By taking the down-sampled
motion segment mt as input, we use an interpolation method to
Algorithm 1 Motion Interpolation
compute the final animation frames for phoneme pt based on
Input: Sample frames, mt ; the derivatives of sample frames,
its duration. We calculate the required number of frames N in
dt ; the target number of frames, N.
final animation as N = duration(pt )×FPS, where duration(pt )
Output: The interpolated sequence, St .
denotes the duration of phoneme pt , and FPS is the frames
per second in the resulting animation (e.g., 30 FPS). 1: for i = 1 → N do

In this work, we use Hermit interpolation to compute in- 2: Find the upper and lower sample frame [lower, up-
between frames between two sample frames. The derivatives per]= Bound(i)
of the five sample frames are computed as follows: 3: α = (i-lower)/(upper-lower)
4: St [i] = HermitInterpolation(upper, lower, α)
5: end for

dt−1 [5]
 if i = 1;
dt [i] = (mt [i + 1] − mt [i − 1])/2 if 1 < i < 5; (5)

mt [i] − mt [i − 1] if i = 5,

5 R ESULTS AND E VALUATIONS
where dt−1 is the derivatives of the previous down-sampled We implemented our approach using C++, without GPU
motion segment. As illustrated in the bottom of Figure 7, acceleration (Figure 8). The test computer used in all our
Hermit interpolation can guarantee the C2 smoothness of the experiments is an off-the-shelf computer with the following
resulting animation. Finally, we use mt and dt to interpolate in- hardware configuration: Intel core I7 2.80GHz CPU and 3GB
between animation frames. The motion interpolation algorithm memory. The used real-time phoneme recognition system is
can be described using the following pseudo code Algorithm 1. the Julius system that is trained based on the open-source
In Algorithm 1, function p=HermitInterpolation(upper, lower, VoxForge acoustic model. In our system, the used 3D face
α) takes the upper and lower sample frames as input and model consists of 30K triangles, and thin-shell deformation
computes the in-between frame corresponding to α ∈ [0, 1]. algorithm is used to deform the rest 3D face model based
Digital Object Indentifier 10.1109/MCG.2014.105 0272-1716/$26.00 2014 IEEE
This article has been accepted for publication in IEEE Computer Graphics and Applications but has not yet been fully edited.
IEEE COMPUTER GRAPHICS AND APPLICATIONS,Some content
VOL. XX, may
NO. XX, change
JULY 2014 prior to final publication. 6

on synthesized facial marker displacements. To improve the to manually specify (or design) visemes. To ensure a fair
realism of synthesized facial animation, we also synthesize comparison between our approach and the chosen baseline
corresponding head and eye movements simultaneously by approach, the visemes used in the implementation of [1] were
implementing the live speech driven head-and-eye motion automatically extracted as the center (mean) of the cluster that
generator [14]. encloses all the representative values of the corresponding
phonemes (refer to Section 3). The configuration of the
baseline approach is also shown in Table 1.
The comparison results are shown in Table 2. Case 2 has
the largest RMSE errors in the three test cases due to the
inaccuracy of live speech phoneme recognition. The RMSE
error of Case 1 is smaller than that of the baseline approach
in all the three test cases; however, we observe that the
animation results in case 1 are clearly better than those by the
baseline approach in terms of motion smoothness and visual
articulation.
Approach Audio 1 Audio 2 Audio 3
Case 1 18.81 19.80 25.77
Case 2 24.16 25.79 27.81
Baseline 22.34 22.40 26.17

TABLE 2
RMSE of the three test speech clips by different
approaches

Fig. 8. A snapshot of our experiment scenario and several


selected frames synthesized by our approach
5.2 Runtime Performance Analysis
In order to analyze the runtime efficiency of our approach, we
5.1 Quantitative Evaluation used three pre-recorded speech clips, each of which is about 2
In order to quantitatively evaluate the quality of our approach, minutes in duration. We sequentially fed the speech clips into
we randomly selected three speech clips with associated mo- our system by simulating them as live speech input. During the
tion from the captured dataset as the test data. We compute the live speech driven lip-sync process, the runtime performance
Root Mean Squared Error (RMSE) between the synthesized of our approach was recorded simultaneously. Table 3 shows
motion and the captured motion (only taking markers in the the breakdown of the per-phoneme computing time of our
lower face into account, as illustrated in Figure 3). To better approach. From Table 3, we can observe the following:
understand the accuracies of our approach under different • The real-time phoneme recognition step consumes about
situations, we synthesized motion in three cases, as shown in 60% of the total computing time of our approach on aver-
Table 1. Here, live speech phoneme recognition means that the age. The average motion synthesis time of our approach
phoneme sequence is sequentially outputted from the JULIUS is about 40% (around 40.85ms) of the total computing
by simulating the pre-recorded speech as live speech input. time, which indicates that our lip-sync algorithm itself is
Real-time facial motion synthesis means the facial motion is highly efficient.
synthesized phoneme by phoneme, as described in this paper. • We also calculated the average phoneme number per
The synthesis algorithm used in both case 1 and case 2 is the second in the test data (shown in the second rightmost
live speech driven lip-sync approach presented in this paper. column) and the average FPS (shown in the rightmost
column). The average FPS is computed as follows:
Live speech Real-time facial
NumberO f Frames(pi )
Approach phoneme recog. motion synthesis ∑Ni=1 TotalComputTime(pi )
Case 1 - Y AverageFPS = , (6)
Case 2 Y Y N
Baseline - Y
where N is the total number of phonemes in the test data;
TABLE 1 NumberO f Frames(pi ) denotes the number of animation
Three cases/configurations of our lip-sync experiments frames needed for visually representing pi , which can
be simply calculated as its duration (in second) mul-
tiplied by 30 frames (per second) in our experiment;
Comparison with baseline approach. We also compared and TotalComputTime(pi ) denotes the total computing
our approach with a baseline approach [1]. Essentially, this time used by our approach for phoneme pi including the
viseme-based baseline approach adopts the classical Cohen- real-time phoneme recognition time and motion synthesis
Massaro co-articulation model [5] to generate speech anima- time. As shown in Table 3, the calculated average FPS
tion frames via interpolation. This method also needs users of our approach is around 27, which indicates that our
Digital Object Indentifier 10.1109/MCG.2014.105 0272-1716/$26.00 2014 IEEE
This article has been accepted for publication in IEEE Computer Graphics and Applications but has not yet been fully edited.
IEEE COMPUTER GRAPHICS AND APPLICATIONS,Some content
VOL. XX, may
NO. XX, change
JULY 2014 prior to final publication. 7

Phoneme Average Average phoneme


Test audio recog. time Motion synthesis Motion blending total time number per second Average FPS
number (ms) (ms) (ms) (ms)
1 76.73 38.56 0.16 128.28 8.84 25.62
2 90.13 39.10 0.14 144.40 7.29 30.03
3 83.80 44.89 0.15 142.23 7.43 26.77

TABLE 3
Breakdown of the per-phoneme computing time of our approach

approach is able to approximately achieve the real-time fiducially retarget 3D facial mocap data to any static 3D face
speed on an off-the-shelf computer, without GPU accel- models.
eration.
ACKNOWLEDGEMENTS
6 C ONCLUSION AND D ISCUSSION This work was supported in part by NSF IIS-0914965,
In this paper, we introduce a practical phoneme-based ap- NIH 1R21HD075048-01A1, and NSFC Overseas and Hong
proach to real-time generate realistic speech animation based Kong/Macau Young Scholars Collaborative Research Award
on live speech input. On the one hand, our approach is (project number: 61328204). The authors would like to thank
very simple and efficient, and can be straightforwardly imple- Bingfeng Li and Lei Xie for helping with the Julius system
mented. On the other hand, it can work surprisingly well for and numerous insightful technical discussion. Any opinions,
most practical applications, achieving a good balance between findings, and conclusions or recommendations expressed in
animation realism and run-time efficiency, as demonstrated this material are those of the authors and do not necessarily
in our results. Because our approach is phoneme-based, it reflect the views of the agencies.
can naturally handle speech input from different speakers,
assuming the employed state-of-the-art speech recognition R EFERENCES
engine can reasonably handle the speaker-independence issue. [1] B. Li, L. Xie, X. Zhou, Z. Fu, and Y. Zhang, “Real-time speech driven
Despite its effectiveness and efficiency, our current approach talking avatar,” J Tinghua Univ. Scicnec and Technology, vol. 51, no. 9,
has two limitations, described below. pp. 1180–1186, 2011.
[2] Y. Xu, A. W. Feng, S. Marsella, and A. Shapiro, “A practical and
• The quality of visual speech animation synthesized by configurable lip sync method for games,” in Proceedings of the Motion
our approach substantially depends on the efficiency and In Games (MIG) 2013. ACM, 2013, pp. 109–118.
accuracy of the used real-time speech (or phoneme) [3] P. Hong, Z. Wen, and T. S. Huang, “Real-time speech-driven face
animation with expressions using neural networks,” IEEE Transactions
recognition engine. As described in our methodology, if on Neural Networks, vol. 13, no. 4, pp. 916–927, 2002.
the speech recognition engine cannot recognize phonemes [4] R. Gutierrez-Osuna, P. Kakumanu, A. Esposito, O. Garcia, A. Bojorquez,
in real-time or with a low accuracy, then such a delay J. L. Castillo, and I. Rudomin, “Speech-driven facial animation with
realistic dynamics,” Multimedia, IEEE Transactions on, vol. 7, no. 1,
or inaccuracy will be unavoidably propagated to or even pp. 33–42, 2005.
exaggerated in the employed AnimPho selection strategy [5] M. M. Cohen and D. W. Massaro, “Modeling coarticulation in synthetic
in our approach. On the other hand, with the continuous visual speech,” in Models and techniques in computer animation.
Springer, 1993, pp. 139–156.
accuracy improvement of state-of-the-art real-time speech [6] A. Wang, M. Emmi, and P. Faloutsos, “Assembling an expressive
recognition techniques in the future, we anticipate that our facial animation system,” in Proceedings of the 2007 ACM SIGGRAPH
approach can generate more realistic live speech driven symposium on Video games. ACM, 2007, pp. 21–26.
[7] C. Bregler, M. Covell, and M. Slaney, “Video rewrite: Driving visual
lip-sync without modifications. speech with audio,” in Proceedings of SIGGRAPH’97, ser. SIGGRAPH
• Our current approach does not consider affective state ’97, ACM. ACM, 1997, pp. 353–360.
enclosed in live speech and thus cannot automatically [8] Y. Cao, W. C. Tien, P. Faloutsos, and F. Pighin, “Expressive speech-
driven facial animation,” ACM Transactions on Graphics (TOG), vol. 24,
synthesize corresponding facial expressions. One of plau- no. 4, pp. 1283–1302, 2005.
sible solutions to this problem would be to utilize state- [9] Z. Deng and U. Neumann, “eFASE: Expressive facial animation syn-
of-the-art techniques to real-time recognize the change of thesis and editing with phoneme-level controls,” in In Proc. of ACM
SIGGGRAPH/Eurographics Symposium on Computer Animation. Eu-
the affective state of a subject based on his/her live speech rographics Association, 2006, pp. 251–259.
alone, and then such information can be continuously fed [10] S. L. Taylor, M. Mahler, B.-J. Theobald, and I. Matthews, “Dy-
into an advanced version of our current approach that namic units of visual speech,” in Proceedings of the 11th ACM SIG-
GRAPH/Eurographics conference on Computer Animation. Eurograph-
can dynamically incorporate the emotion factor into the ics Association, 2012, pp. 275–284.
expressive facial motion synthesis process. [11] X. Ma and Z. Deng, “A statistical quality model for data-driven
speech animation,” IEEE Transactions on Visualization and Computer
As the future work, besides extending the current approach Graphics, vol. 18, no. 11, pp. 1915–1927, 2012.
to various mobile platforms, we also plan to conduct compre- [12] M. Brand, “Voice puppetry,” in Proceedings of the 26th annual confer-
hensive user studies to evaluate its effectiveness and usability. ence on Computer graphics and interactive techniques, ser. SIGGRAPH
’99. ACM, 1999, pp. 21–28.
Also, we would like to investigate better methods to increase [13] T. Ezzat, G. Geiger, and T. Poggio, “Trainable videorealistic speech
the phoneme recognition accuracy from live speech and to animation,” ACM Trans. Graph., vol. 21, no. 3, pp. 388–398, Jul. 2002.

Digital Object Indentifier 10.1109/MCG.2014.105 0272-1716/$26.00 2014 IEEE


This article has been accepted for publication in IEEE Computer Graphics and Applications but has not yet been fully edited.
IEEE COMPUTER GRAPHICS AND APPLICATIONS,Some content
VOL. XX, may
NO. XX, change
JULY 2014 prior to final publication. 8

[14] B. Le, X. Ma, and Z. Deng, “Live speech driven head-and-eye motion
generators,” IEEE Transactions on Visualization and Computer Graph-
ics, vol. 18, no. 11, pp. 1902–1914, 2012.

AUTHORS ’ B IOGRAPHIES
Li Wei is a PhD student in the Department of Computer
Science at University of Houston. His research interests
include computer graphics and animation. He had received
his B.S. in Automation and M.S. in Automation from Xiamen
University, China, in 2009 and 2012, respectively.

Zhigang Deng is an Associate Professor of Computer


Science at the University of Houston. His research interests
include computer graphics, computer animation, and human
computer interaction. He earned his Ph.D. from University of
Southern California in 2006, M.S. from Peking University in
2000, and B.S. from Xiamen University in 1997, respectively.

Digital Object Indentifier 10.1109/MCG.2014.105 0272-1716/$26.00 2014 IEEE

Вам также может понравиться