Академический Документы
Профессиональный Документы
Культура Документы
Abstract—In this article we introduce a simple, efficient, yet practical the synthesis process. Second, live speech driven lip-sync al-
phoneme-based approach to generate realistic speech animation in gorithms need to be highly efficient to ensure a real-time speed
real-time based on live speech input. Specifically, we first decom- of generating speech animation on an off-the-shelf computer,
pose lower-face movements into low-dimensional principal component
while off-line speech animation synthesis algorithms do not
spaces. Then, in each of the retained principal component spaces, we
select the AnimPho with the highest priority value and the minimum need to meet such a tight constraint. Compared with the case
smoothness energy. Finally, we apply motion blending and interpolation of the forced phoneme alignment for pre-recorded speech,
techniques to compute final animation frames for the currently inputted the last challenge comes from the low accuracy of state-of-
phoneme. Through many experiments and comparisons, we demon- the-art live speech phoneme recognition systems (e.g., the
strate the realism of synthesized speech animation by our approach as Julius system (http://julius.sourceforge.jp) or the HTK toolkit
well as its real-time efficiency on an off-the-shelf computer.
(http://htk.eng.cam.ac.uk)).
Index Terms—Facial animation, speech animation, live speech driven, In order to quantify the phoneme recognition accuracy dif-
talking avatars, virtual humans, data-driven ference between the pre-recorded speech and live speech cases,
we performed an empirical study as follows. We randomly se-
lected 10 pre-recorded sentences and extracted their phoneme
1 I NTRODUCTION sequences using the following two different approaches: (1)
The Julius system was used to do forced phoneme-alignment
In signal processing and speech understanding communities,
on the pre-recorded 10 speech clips (called offline phoneme-
several approaches have been proposed to generate speech
alignment), and (2) The same Julius system was used as a
animation based on live acoustic speech input. For exam-
real-time phoneme recognition engine. In other words, by
ple, based on a real-time recognized phoneme sequence,
simulating the same pre-recorded speech clip as live speech, it
researchers use simple linear smoothing functions to produce
outputted phonemes sequentially while the speech was being
corresponding speech animation [1], [2]. Meanwhile, a number
fed into the system. Then, by taking the offline phoneme-
of approaches train statistical models (e.g., neural networks)
alignment results as the ground-truth, we can compute the
to encode the mapping between acoustic speech features and
accuracies of the live speech phoneme recognition in our
facial movements [3], [4]. These approaches demonstrated
experiment. As illustrated in Figure 1, the live speech phoneme
their real-time runtime efficiency on an off-the-shelf computer;
recognition accuracy of the same Julius system varies from
however, their performance is highly speaker-dependent due
45% to 80%. Further empirical analysis did not show any
to the individual-specific nature of the chosen acoustic speech
patterns of incorrectly recognized phonemes (that is, which
features. Furthermore, due to the insufficient visual realism of
phonemes are often recognized incorrectly in live speech
these approaches, practically they are less suitable for graphics
case). This empirical finding implies that in order to pro-
and animation applications.
duce satisfactory live speech driven speech animation results,
Challenges of live speech driven lip-sync. First, live
any phoneme-based algorithm must take the relatively low
speech driven lip-sync imposes additional technical challenges
phoneme recognition accuracy (in the case of live speech)
than the off-line case where expensive global optimization
into design consideration, and it should be able to perform
techniques can be employed to solve the most plausible
certain self-correction at runtime, since some phonemes could
speech motion corresponding to novel spoken or typed input.
be incorrectly recognized and inputted into the algorithm, in
In contrast, it is extremely difficult, if not impossible, to
a less predictable manner.
directly apply such global optimization techniques to live
Inspired by the above research challenges, in this paper we
speech driven lip-sync applications, since the forthcoming
propose a practical phoneme-based approach for live speech
(unavailable yet) speech content cannot be exploited during
driven lip-sync. Besides generating realistic speech animation
in real-time, our phoneme-based approach can straightfor-
• L. Wei and Z. Deng are with the Computer Graphics and Interactive Media
Lab and the Department of Computer Science, University of Houston,
wardly handle speech input from different speakers, which
Houston, TX 77204-3010. is one of the major advantages of phoneme-based approaches
• E-mail: zdeng4@uh.edu. over acoustic speech feature driven approaches [3], [4]. Specif-
ically, we introduce an efficient, simple algorithm to compute
Digital Object Indentifier 10.1109/MCG.2014.105 0272-1716/$26.00 2014 IEEE
This article has been accepted for publication in IEEE Computer Graphics and Applications but has not yet been fully edited.
IEEE COMPUTER GRAPHICS AND APPLICATIONS,Some content
VOL. XX, may
NO. XX, change
JULY 2014 prior to final publication. 2
1
acoustics features used for training these statistical audio-to-
Clustering: We apply K-Means clustering algorithm to the φ (pt ) is empty (called a miss in this paper), then we reassign
motion vector dataset {vi : i = 1...n}, where n is the total all the AnimPhos (regardless their phoneme labels) as φ (pt ).
number of motion vectors. Euclidean distance is used to
compute the distance between two motion vectors. The center at = argmin dist(E(at−1 ), B(η)), (1)
η∈φ (pt )
of each obtained motion cluster is called an AnimPho in this
writing. Finally, we obtain a set of AnimPhos {ai : i = 1...m}, where dist(E(at−1 ), B(η)) returns the Euclidean distance
where m is the number of AnimPhos and ai is the i-th between the ending frame of at−1 and the starting frame of
AnimPho. η. If L(pt ) ∧ T (at−1 ) is not empty, this method can ensure
After the above clustering step, we essentially convert the the co-articulation by finding the AnimPho at that is naturally
motion dataset to a set of AnimPho sequences. Then, we connected to at−1 in the captured dataset. However, it does
further pre-compute the following two data structures: not work well in practical applications. The main reason is
• AnimPho transition table, T . It is a m × m table, where that, the missing rate of this method is typically very high
m is the number of AnimPhos. T (ai , a j ) = 1 if the An- due to the limited size of the captured dataset; as a result, the
imPho transition < ai , a j > exists in the motion dataset, synthesized animations are often incorrectly articulated and
otherwise T (ai , a j ) = 0. In the remainder of this paper, we over-smoothed (i.e., only minimizing the smoothness energy
call two AnimPhos ai and a j are connected if and only if defined in Eq. 1).
T (ai , a j ) = 1. We also define T (ai ) = {a j |T (ai , a j ) = 1}
as the set of AnimPhos that are connected to ai .
• Phoneme-AnimPho mapping table, L. It is a h × m table, p1 p2 p3
where h is the number of used phonemes (43 English
phonemes are used in this work). L(pi , a j ) = 1 if the
1 4 7
phoneme label of a j is phoneme pi in the dataset,
otherwise L(pi , a j ) = 0. Similarly, we define L(pi ) =
{a j |L(pi , a j ) = 1} as the set of AnimPhos that have the
phoneme label pi . 2 5 8
3 6 9
miss miss
Our method
sequence corresponding to the observed phoneme sequence 0.9 The naive method
(up to pt ). The priority value vi of ai is calculated as follows: 0.8
0.7
Missing Rate
( 0.6
max(v j + 1) ∃a j ∈ L(pt−1 ), ai ∈ T (a j ) 0.5
vi = (2) 0.4
1 Otherwise 0.3
0.2
0.1
Then, we select AnimPho at by taking both the priority 0
value and the smoothness term into consideration. In our 1 2 3 4 5 6 7 8
Motion
Eq. 3 is guaranteed to return bt = at .
Generation of an intermediate motion segment, mt .
After bt is identified, we first need to modify bt to ensure the
motion continuity between the previous animation segment,
pt-1 pt Time
at−1 , and the animation for pt that we are going to synthesize.
In this writing, we use bt0 to denote the modified version of
at
bt . Then, we need to generate an intermediate motion segment
b t’
mt by linearly blending bt0 and at .
Motion
All the AnimPhos (including bt , at−1 , and at ) store the
center of corresponding motion segment clusters, as described
in Section 3; they do not retain the original motion trajectory
in the dataset. As a result, the starting frame of bt and the pt-1 pt Time
ending frame of at−1 may be slightly mismatched although the
AnimPho transition < at−1 , bt > indeed exists in the captured at mt
motion dataset (refer to the top panel of Figure 7). To make the b t’ dt
Motion
starting frame of bt perfectly matched with the ending frame
of at−1 , we need to adjust (via a translation transformation)
bt by setting the starting frame of bt equivalent to the ending
frame of at−1 (refer to the middle panel of Figure 7).
Then, we linearly blend two motion segments, bt0 and at , pt-1 pt Time
to obtain an intermediate motion segment mt , as illustrated in
the bottom panel of Figure 7. As described in Section 3, we Fig. 7. Illustration of the motion blending and interpolation
use 5 evenly sampled frames to represent a motion segment. process. Top: After the AnimPho selection step, we can
Therefore, this linear blending can be described using the see that either at or bt has a smooth transition from the
following Equation 4. previous animation segment (denoted as the black solid
line). Middle: We set the starting frame of bt equivalent to
the ending frame of the previous animation segment (for
i−1 i−1 phoneme pt−1 ). Bottom: We linearly blend at and bt0 to
mt [i] = (1 − ) × bt0 [i] + × at [i], (1 ≤ i ≤ 5), (4)
4 4 obtain a new motion segment mt and further compute the
where mt [i] is the i-th frame of the motion segment mt . derivatives of mt . Finally, we apply Hermit interpolation to
obtain in-between frames.
Motion interpolation. By taking the down-sampled
motion segment mt as input, we use an interpolation method to
Algorithm 1 Motion Interpolation
compute the final animation frames for phoneme pt based on
Input: Sample frames, mt ; the derivatives of sample frames,
its duration. We calculate the required number of frames N in
dt ; the target number of frames, N.
final animation as N = duration(pt )×FPS, where duration(pt )
Output: The interpolated sequence, St .
denotes the duration of phoneme pt , and FPS is the frames
per second in the resulting animation (e.g., 30 FPS). 1: for i = 1 → N do
In this work, we use Hermit interpolation to compute in- 2: Find the upper and lower sample frame [lower, up-
between frames between two sample frames. The derivatives per]= Bound(i)
of the five sample frames are computed as follows: 3: α = (i-lower)/(upper-lower)
4: St [i] = HermitInterpolation(upper, lower, α)
5: end for
dt−1 [5]
if i = 1;
dt [i] = (mt [i + 1] − mt [i − 1])/2 if 1 < i < 5; (5)
mt [i] − mt [i − 1] if i = 5,
5 R ESULTS AND E VALUATIONS
where dt−1 is the derivatives of the previous down-sampled We implemented our approach using C++, without GPU
motion segment. As illustrated in the bottom of Figure 7, acceleration (Figure 8). The test computer used in all our
Hermit interpolation can guarantee the C2 smoothness of the experiments is an off-the-shelf computer with the following
resulting animation. Finally, we use mt and dt to interpolate in- hardware configuration: Intel core I7 2.80GHz CPU and 3GB
between animation frames. The motion interpolation algorithm memory. The used real-time phoneme recognition system is
can be described using the following pseudo code Algorithm 1. the Julius system that is trained based on the open-source
In Algorithm 1, function p=HermitInterpolation(upper, lower, VoxForge acoustic model. In our system, the used 3D face
α) takes the upper and lower sample frames as input and model consists of 30K triangles, and thin-shell deformation
computes the in-between frame corresponding to α ∈ [0, 1]. algorithm is used to deform the rest 3D face model based
Digital Object Indentifier 10.1109/MCG.2014.105 0272-1716/$26.00 2014 IEEE
This article has been accepted for publication in IEEE Computer Graphics and Applications but has not yet been fully edited.
IEEE COMPUTER GRAPHICS AND APPLICATIONS,Some content
VOL. XX, may
NO. XX, change
JULY 2014 prior to final publication. 6
on synthesized facial marker displacements. To improve the to manually specify (or design) visemes. To ensure a fair
realism of synthesized facial animation, we also synthesize comparison between our approach and the chosen baseline
corresponding head and eye movements simultaneously by approach, the visemes used in the implementation of [1] were
implementing the live speech driven head-and-eye motion automatically extracted as the center (mean) of the cluster that
generator [14]. encloses all the representative values of the corresponding
phonemes (refer to Section 3). The configuration of the
baseline approach is also shown in Table 1.
The comparison results are shown in Table 2. Case 2 has
the largest RMSE errors in the three test cases due to the
inaccuracy of live speech phoneme recognition. The RMSE
error of Case 1 is smaller than that of the baseline approach
in all the three test cases; however, we observe that the
animation results in case 1 are clearly better than those by the
baseline approach in terms of motion smoothness and visual
articulation.
Approach Audio 1 Audio 2 Audio 3
Case 1 18.81 19.80 25.77
Case 2 24.16 25.79 27.81
Baseline 22.34 22.40 26.17
TABLE 2
RMSE of the three test speech clips by different
approaches
TABLE 3
Breakdown of the per-phoneme computing time of our approach
approach is able to approximately achieve the real-time fiducially retarget 3D facial mocap data to any static 3D face
speed on an off-the-shelf computer, without GPU accel- models.
eration.
ACKNOWLEDGEMENTS
6 C ONCLUSION AND D ISCUSSION This work was supported in part by NSF IIS-0914965,
In this paper, we introduce a practical phoneme-based ap- NIH 1R21HD075048-01A1, and NSFC Overseas and Hong
proach to real-time generate realistic speech animation based Kong/Macau Young Scholars Collaborative Research Award
on live speech input. On the one hand, our approach is (project number: 61328204). The authors would like to thank
very simple and efficient, and can be straightforwardly imple- Bingfeng Li and Lei Xie for helping with the Julius system
mented. On the other hand, it can work surprisingly well for and numerous insightful technical discussion. Any opinions,
most practical applications, achieving a good balance between findings, and conclusions or recommendations expressed in
animation realism and run-time efficiency, as demonstrated this material are those of the authors and do not necessarily
in our results. Because our approach is phoneme-based, it reflect the views of the agencies.
can naturally handle speech input from different speakers,
assuming the employed state-of-the-art speech recognition R EFERENCES
engine can reasonably handle the speaker-independence issue. [1] B. Li, L. Xie, X. Zhou, Z. Fu, and Y. Zhang, “Real-time speech driven
Despite its effectiveness and efficiency, our current approach talking avatar,” J Tinghua Univ. Scicnec and Technology, vol. 51, no. 9,
has two limitations, described below. pp. 1180–1186, 2011.
[2] Y. Xu, A. W. Feng, S. Marsella, and A. Shapiro, “A practical and
• The quality of visual speech animation synthesized by configurable lip sync method for games,” in Proceedings of the Motion
our approach substantially depends on the efficiency and In Games (MIG) 2013. ACM, 2013, pp. 109–118.
accuracy of the used real-time speech (or phoneme) [3] P. Hong, Z. Wen, and T. S. Huang, “Real-time speech-driven face
animation with expressions using neural networks,” IEEE Transactions
recognition engine. As described in our methodology, if on Neural Networks, vol. 13, no. 4, pp. 916–927, 2002.
the speech recognition engine cannot recognize phonemes [4] R. Gutierrez-Osuna, P. Kakumanu, A. Esposito, O. Garcia, A. Bojorquez,
in real-time or with a low accuracy, then such a delay J. L. Castillo, and I. Rudomin, “Speech-driven facial animation with
realistic dynamics,” Multimedia, IEEE Transactions on, vol. 7, no. 1,
or inaccuracy will be unavoidably propagated to or even pp. 33–42, 2005.
exaggerated in the employed AnimPho selection strategy [5] M. M. Cohen and D. W. Massaro, “Modeling coarticulation in synthetic
in our approach. On the other hand, with the continuous visual speech,” in Models and techniques in computer animation.
Springer, 1993, pp. 139–156.
accuracy improvement of state-of-the-art real-time speech [6] A. Wang, M. Emmi, and P. Faloutsos, “Assembling an expressive
recognition techniques in the future, we anticipate that our facial animation system,” in Proceedings of the 2007 ACM SIGGRAPH
approach can generate more realistic live speech driven symposium on Video games. ACM, 2007, pp. 21–26.
[7] C. Bregler, M. Covell, and M. Slaney, “Video rewrite: Driving visual
lip-sync without modifications. speech with audio,” in Proceedings of SIGGRAPH’97, ser. SIGGRAPH
• Our current approach does not consider affective state ’97, ACM. ACM, 1997, pp. 353–360.
enclosed in live speech and thus cannot automatically [8] Y. Cao, W. C. Tien, P. Faloutsos, and F. Pighin, “Expressive speech-
driven facial animation,” ACM Transactions on Graphics (TOG), vol. 24,
synthesize corresponding facial expressions. One of plau- no. 4, pp. 1283–1302, 2005.
sible solutions to this problem would be to utilize state- [9] Z. Deng and U. Neumann, “eFASE: Expressive facial animation syn-
of-the-art techniques to real-time recognize the change of thesis and editing with phoneme-level controls,” in In Proc. of ACM
SIGGGRAPH/Eurographics Symposium on Computer Animation. Eu-
the affective state of a subject based on his/her live speech rographics Association, 2006, pp. 251–259.
alone, and then such information can be continuously fed [10] S. L. Taylor, M. Mahler, B.-J. Theobald, and I. Matthews, “Dy-
into an advanced version of our current approach that namic units of visual speech,” in Proceedings of the 11th ACM SIG-
GRAPH/Eurographics conference on Computer Animation. Eurograph-
can dynamically incorporate the emotion factor into the ics Association, 2012, pp. 275–284.
expressive facial motion synthesis process. [11] X. Ma and Z. Deng, “A statistical quality model for data-driven
speech animation,” IEEE Transactions on Visualization and Computer
As the future work, besides extending the current approach Graphics, vol. 18, no. 11, pp. 1915–1927, 2012.
to various mobile platforms, we also plan to conduct compre- [12] M. Brand, “Voice puppetry,” in Proceedings of the 26th annual confer-
hensive user studies to evaluate its effectiveness and usability. ence on Computer graphics and interactive techniques, ser. SIGGRAPH
’99. ACM, 1999, pp. 21–28.
Also, we would like to investigate better methods to increase [13] T. Ezzat, G. Geiger, and T. Poggio, “Trainable videorealistic speech
the phoneme recognition accuracy from live speech and to animation,” ACM Trans. Graph., vol. 21, no. 3, pp. 388–398, Jul. 2002.
[14] B. Le, X. Ma, and Z. Deng, “Live speech driven head-and-eye motion
generators,” IEEE Transactions on Visualization and Computer Graph-
ics, vol. 18, no. 11, pp. 1902–1914, 2012.
AUTHORS ’ B IOGRAPHIES
Li Wei is a PhD student in the Department of Computer
Science at University of Houston. His research interests
include computer graphics and animation. He had received
his B.S. in Automation and M.S. in Automation from Xiamen
University, China, in 2009 and 2012, respectively.