The training of high-quality, robust machine learning models for speech-driven 3D facial animation requires a large, diverse dataset of high-quality audio-animation pairs. To overcome the lack of such a dataset, recent work has introduced large pre-trained speech encoders that are robust to variations in the input audio and, therefore, enable the facial animation model to generalize across speakers, audio quality, and languages. However, the resulting facial animation models are prohibitively large and lend themselves only to offline inference on a dedicated machine. In this work, we explore on-device, real-time facial animation models in the context of game development. We overcome the lack of large datasets by using hybrid knowledge distillation with pseudo-labeling. Given a large audio dataset, we employ a high-performing teacher model to train very small student models. In contrast to the pre-trained speech encoders, our student models only consist of convolutional and fully-connected layers, removing the need for attention context or recurrent updates. In our experiments, we demonstrate that we can reduce the memory footprint to up to 3.4 MB and required future audio context to up to 81 ms while maintaining high-quality animations. This paves the way for on-device inference, an important step towards realistic, model-driven digital characters.
Our knowledge distillation framework operates in two stages: heterogeneous distillation and hybrid distillation. In the heterogeneous stage, we distill a large transformer-based facial animation model into a compact model composed solely of convolutional and fully connected layers. In the hybrid stage, the previously trained compact model is frozen and used as a second teacher. A tiny model and a low-latency model, both sharing a convolutional architecture similar to the compact model, are supervised by these two teachers.
We use the Voice2Face model as the teacher example. It consists of three components: a HuBERT encoder, a mesh generator, and a Mesh2Rig model. It processes the entire input sentence (usually > 5s) to produce rig parameters for each frame. In contrast, the small student model takes raw audio waveforms directly as input and generates rig parameters frame by frame. The input corresponding to each frame is a 512 ms window, spanning 256 ms into the past and the future.
Dataset: LibriSpeech 960 hours
The significant difference between the teacher and the small student makes intermediate feature loss impractical. Hence, we only apply the loss on the labels — the predicted rig parameters and pseudo labels generated by the teacher.
We design two variants based on the small student model. To enable on-device deployment with low memory usage, we reduce the number of channels by 75%. For a real-time version, we reduce the future context in the input window while keeping the total window length unchanged.
We could train these two variants in the same way as the small student. However, since the small model and its two variants share similar architectures, a homogeneous feature loss is applied between them in addition to the previous label loss.
We observed significant jitter for the real-time variant. To improve that, we design a ensemble prediction method tailored for real-time systems. For each frame at time point \( t \), the smoothed predicted rig parameters is given by a weighted sum: \[ \hat{\mathbf{r}}_t^{smooth} = \alpha_1\hat{\mathbf{r}}_{t-16.7ms}+\alpha_2\hat{\mathbf{r}}_{t}+\alpha_3\hat{\mathbf{r}}_{t+16.7ms}, \] We choose \(\alpha_1=\alpha_2=\alpha_3=\frac{1}{3}\). This results in an average of rig predictions from three consecutive frames generated at 60 FPS, while the animation can still be rendered at 30 FPS. Notably, this smoothed prediction results in a 16.7 ms increase in latency. Memory consumption increases by an approximate factor of two as inference is run twice as often. Peak memory consumption stays constant.
The LibriSpeech test audio used here is a mixture of samples from test-clean and test-other.
The current Voice2Face teacher model is slightly different from the orginal version. MFCCs and SSCs feature are replaced with HuBERT, while the model architecture and training framework remain the same. See full details in our paper.
CodeTalker predicts mesh data. We use PCA to reduce the mesh dimension to 50 (99.9% variance retained), then the last layer of our student model is adapted to it accordingly. See full details in our paper.
@article{Tiny_sig25,
author = {Han, Zhen and Teye, Mattias and Yadgaroff, Derek and Bütepage, Judith},
title = {Tiny is not small enough: High quality, low-resource facial animation through hybrid knowledge distillation },
year = {2025},
publisher = {Association for Computing Machinery},
volume = {44},
number = {4},
url = {https://doi.org/10.1145/3730929},
doi = {10.1145/3730929},
journal = {ACM Trans. Graph.},
month = {Aug},
articleno = {104}
}