TINY IS NOT SMALL ENOUGH:

High quality, low-resource facial animation
models through hybrid knowledge distillation

Zhen Han, Mattias Teye, Derek Yadgaroff, Judith Bütepage
SEED, Electronic Arts, Stockholm, Sweden
ACM Transactions on Graphics 2025

Real-time demo

Abstract

The training of high-quality, robust machine learning models for speech-driven 3D facial animation requires a large, diverse dataset of high-quality audio-animation pairs. To overcome the lack of such a dataset, recent work has introduced large pre-trained speech encoders that are robust to variations in the input audio and, therefore, enable the facial animation model to generalize across speakers, audio quality, and languages. However, the resulting facial animation models are prohibitively large and lend themselves only to offline inference on a dedicated machine. In this work, we explore on-device, real-time facial animation models in the context of game development. We overcome the lack of large datasets by using hybrid knowledge distillation with pseudo-labeling. Given a large audio dataset, we employ a high-performing teacher model to train very small student models. In contrast to the pre-trained speech encoders, our student models only consist of convolutional and fully-connected layers, removing the need for attention context or recurrent updates. In our experiments, we demonstrate that we can reduce the memory footprint to up to 3.4 MB and required future audio context to up to 81 ms while maintaining high-quality animations. This paves the way for on-device inference, an important step towards realistic, model-driven digital characters.

Video Presentation

Pipeline

Overview

Our knowledge distillation framework operates in two stages: heterogeneous distillation and hybrid distillation. In the heterogeneous stage, we distill a large transformer-based facial animation model into a compact model composed solely of convolutional and fully connected layers. In the hybrid stage, the previously trained compact model is frozen and used as a second teacher. A tiny model and a low-latency model, both sharing a convolutional architecture similar to the compact model, are supervised by these two teachers.

Descriptive alt text

Stage 1: Heterogeneous distillation - small student

We use the Voice2Face model as the teacher example. It consists of three components: a HuBERT encoder, a mesh generator, and a Mesh2Rig model. It processes the entire input sentence (usually > 5s) to produce rig parameters for each frame. In contrast, the small student model takes raw audio waveforms directly as input and generates rig parameters frame by frame. The input corresponding to each frame is a 512 ms window, spanning 256 ms into the past and the future.

Descriptive alt text

Dataset: LibriSpeech 960 hours

The significant difference between the teacher and the small student makes intermediate feature loss impractical. Hence, we only apply the loss on the labels — the predicted rig parameters and pseudo labels generated by the teacher.

Stage 2: Hybrid distillation - on-device & real-time student

We design two variants based on the small student model. To enable on-device deployment with low memory usage, we reduce the number of channels by 75%. For a real-time version, we reduce the future context in the input window while keeping the total window length unchanged.

Descriptive alt text

We could train these two variants in the same way as the small student. However, since the small model and its two variants share similar architectures, a homogeneous feature loss is applied between them in addition to the previous label loss.

Post-processing: real-time ensemble prediction

We observed significant jitter for the real-time variant. To improve that, we design a ensemble prediction method tailored for real-time systems. For each frame at time point \( t \), the smoothed predicted rig parameters is given by a weighted sum: \[ \hat{\mathbf{r}}_t^{smooth} = \alpha_1\hat{\mathbf{r}}_{t-16.7ms}+\alpha_2\hat{\mathbf{r}}_{t}+\alpha_3\hat{\mathbf{r}}_{t+16.7ms}, \] We choose \(\alpha_1=\alpha_2=\alpha_3=\frac{1}{3}\). This results in an average of rig predictions from three consecutive frames generated at 60 FPS, while the animation can still be rendered at 30 FPS. Notably, this smoothed prediction results in a 16.7 ms increase in latency. Memory consumption increases by an approximate factor of two as inference is run twice as often. Peak memory consumption stays constant.

Full videos

The LibriSpeech test audio used here is a mixture of samples from test-clean and test-other.

Voice2Face - Teacher

Voice2Face - Small student

Voice2Face - On-device student

Voice2Face - Real-time student

Voice2Face - Real-time student (smoothed)

CodeTalker - Teacher (left) & Small student (right)

Related links

Voice2Face (SEED, Electronic Arts)

The current Voice2Face teacher model is slightly different from the orginal version. MFCCs and SSCs feature are replaced with HuBERT, while the model architecture and training framework remain the same. See full details in our paper.

CodeTalker (CVPR 2023)

CodeTalker predicts mesh data. We use PCA to reduce the mesh dimension to 50 (99.9% variance retained), then the last layer of our student model is adapted to it accordingly. See full details in our paper.

BibTeX


        @article{Tiny_sig25,
        author = {Han, Zhen and Teye, Mattias and Yadgaroff, Derek and Bütepage, Judith},
        title = {Tiny is not small enough: High quality, low-resource facial animation through hybrid knowledge distillation },
        year = {2025},
        publisher = {Association for Computing Machinery},
        volume = {44},
        number = {4},
        url = {https://doi.org/10.1145/3730929},
        doi = {10.1145/3730929},
        journal = {ACM Trans. Graph.},
        month = {Aug},
        articleno = {104}
        }