VQTalker: Towards Multilingual Talking Avatars through Facial Motion Tokenization

Demos | Repo


Audio-driven Talking Face Generation (Multi-lingual)

Note:

  • This section demonstrates the robustness of the algorithm in comparing multiple languages.
  • The left side shows the driven portrait, while the right side displays the images generated by our model, with poses taken from random videos.
  • All faces in this section are fictional and do not exist. The images of people are randomly picked from thispersondoesnotexist website. We express our special thanks to this website.

English

Audio samples are from MEAD


Below are the generalized results for unseen languages:

Amharic

Audio samples are from ASED


Cantonese

Audio samples are from CosyVoice


French

Audio samples are from CaFE


German

Audio samples are from EmoDB


Italian

Audio samples are from Emozionalmente


Japanese

Audio samples are from CosyVoice


Korean

Audio samples are from CosyVoice


Mandarin

Audio samples are from Aishell1


Turkish

Audio samples are from TurEV-DB


URDU

Audio samples are from URDU

Audio-driven Method Comparison

Note:

  • This section demonstrates the comparison between our method and other methods in audio-driven scenarios.
  • The source portrait is the driven frames, and below are the generated results.
  • If these algorithms can control posture, blinking, etc., these control signals come from the source video.

Arabic (MNTE dataset)

Source Portrait:

Portrait Image

Result:

SadTalker EAT PD-FGC AniTalker EDTalker EchoMimic VQTalker (Ours)


Japanese (MNTE dataset)

Source Portrait:

Portrait Image

Result:

SadTalker EAT PD-FGC AniTalker EDTalker EchoMimic VQTalker (Ours)


Korean (MNTE dataset)

Source Portrait:

Portrait Image

Result:

SadTalker EAT PD-FGC AniTalker EDTalker EchoMimic VQTalker (Ours)


Mandarin (MNTE dataset)

Source Portrait:

Portrait Image

Result:

SadTalker EAT PD-FGC AniTalker EDTalker EchoMimic VQTalker (Ours)


Swahili (MNTE dataset)

Source Portrait:

Portrait Image

Result:

SadTalker EAT PD-FGC AniTalker EDTalker EchoMimic VQTalker (Ours)


Turkish (MNTE dataset)

Source Portrait:

Portrait Image

Result:

SadTalker EAT PD-FGC AniTalker EDTalker EchoMimic VQTalker (Ours)


Video Reconstruction

Note:

  • This section demonstrates the reconstruction capability of our face tokenizer
  • We use the first frame of the video and subsequent features (extracted from subsequent frames) to reconstruct the entire video
  • This video highlights our compact and efficient features, with the lowest bitrate among these methods, approximately 11kbps

Video1:

H.264
(347 kbps)
FOMM
(48 kbps)
DPE
(16 kbps)
MTIA
(48 kbps)
Vid2Vid
(36 kbps)
LIA
(16 kbps)
FADM
(36 kbps)
AniTalke
(16 kbps)
LivePortrait
(50 kbps)
FaceTokenizer
(Ours, 11 kbps)

Video2:

H.264
(347 kbps)
FOMM
(48 kbps)
DPE
(16 kbps)
MTIA
(48 kbps)
Vid2Vid
(36 kbps)
LIA
(16 kbps)
FADM
(36 kbps)
AniTalke
(16 kbps)
LivePortrait
(50 kbps)
FaceTokenizer
(Ours, 11 kbps)

Video3:

H.264
(347 kbps)
FOMM
(48 kbps)
DPE
(16 kbps)
MTIA
(48 kbps)
Vid2Vid
(36 kbps)
LIA
(16 kbps)
FADM
(36 kbps)
AniTalke
(16 kbps)
LivePortrait
(50 kbps)
FaceTokenizer
(Ours, 11 kbps)

Video4:

H.264
(347 kbps)
FOMM
(48 kbps)
DPE
(16 kbps)
MTIA
(48 kbps)
Vid2Vid
(36 kbps)
LIA
(16 kbps)
FADM
(36 kbps)
AniTalke
(16 kbps)
LivePortrait
(50 kbps)
FaceTokenizer
(Ours, 11 kbps)


Coarse-to-fine Visualization

Note:

  • This section demonstrates the modeling capabilities of different residual layers, and we will mask some residual layers.
  • We use the first frame of the video and subsequent features (extracted from VQ features) to reconstruct the entire video.
  • This video highlights the generation of residual VQ from coarse to fine granularity.

Observing the progression, we find that the first codebook level models coarse-grained features such as head pose incompletely. Introducing the second residual codebook level captures finer details like eye blinks and lip movements. The third codebook level achieves a near-complete reconstruction of the original video, though some jitter remains. Finally, incorporating all codebook levels (the complete set) eliminates the remaining jitter, producing a smooth and accurate result. This demonstration underscores the hierarchical nature of our residual codebook approach, showcasing how each level contributes to increasingly refined and faithful video reconstruction.

(1) only the first codebook level, masking the subsequent three:

Source
Driven Feature
(Providing VQ features for driving)
Result

(2) the first two codebook levels, masking the latter two:

Source
Driven Feature
(Providing VQ features for driving)
Result

(3) the first three codebook levels, masking only the final one:

Source
Driven Feature
(Providing VQ features for driving)
Result

(4) all codebook levels:

Source
Driven Feature
(Providing VQ features for driving)
Result

Codebook Ablation

Note:

  • We tried different VQ methods as the bottleneck structure of our face tokenizer
  • The order of the methods in the figure below is consistent with the order in the paper's chart

Video 1:

Driven Video
VQ
GVQ
RVQ
GRVQ
GRFSQ #1
GRFSQ #2
GRFSQ #3
GRFSQ #4
(Ours)

Video 2:

Driven Video
VQ
GVQ
RVQ
GRVQ
GRFSQ #1
GRFSQ #2
GRFSQ #3
GRFSQ #4
(Ours)

Video 3:

Driven Video
VQ
GVQ
RVQ
GRVQ
GRFSQ #1
GRFSQ #2
GRFSQ #3
GRFSQ #4
(Ours)

Video 4:

Driven Video
VQ
GVQ
RVQ
GRVQ
GRFSQ #1
GRFSQ #2
GRFSQ #3
GRFSQ #4
(Ours)

Video 5 (Cross Video Driven):

Driven Video
VQ
GVQ
RVQ
GRVQ
GRFSQ #1
GRFSQ #2
GRFSQ #3
GRFSQ #4
(Ours)

Discrete vs. Continuous Representation

Note:

  • We tested different types of features as input and output
  • C represents continuous features, and D represents discrete features
  • The configuration of the methods here is consistent with the order of the table in the paper

Video 1:

C-to-C
(Whisper Continuous Vector to Continuous Vector)
D-to-C
(CosyVoice Speech Tokens to Continuous Vector)
C-to-D
(Whisper Continuous Vector to Discrete Vector)
D-to-D
(VQ-wav2vec Speech Tokens to Discrete Vector)
D-to-D
(CosyVoice Speech Tokens to Discrete Vector)
(Ours)

Video 2:

C-to-C
(Whisper Continuous Vector to Continuous Vector)
D-to-C
(CosyVoice Speech Tokens to Continuous Vector)
C-to-D
(Whisper Continuous Vector to Discrete Vector)
D-to-D
(VQ-wav2vec Speech Tokens to Discrete Vector)
D-to-D
(CosyVoice Speech Tokens to Discrete Vector)
(Ours)

Video 3:

C-to-C
(Whisper Continuous Vector to Continuous Vector)
D-to-C
(CosyVoice Speech Tokens to Continuous Vector)
C-to-D
(Whisper Continuous Vector to Discrete Vector)
D-to-D
(VQ-wav2vec Speech Tokens to Discrete Vector)
D-to-D
(CosyVoice Speech Tokens to Discrete Vector)
(Ours)

Video 4:

C-to-C
(Whisper Continuous Vector to Continuous Vector)
D-to-C
(CosyVoice Speech Tokens to Continuous Vector)
C-to-D
(Whisper Continuous Vector to Discrete Vector)
D-to-D
(VQ-wav2vec Speech Tokens to Discrete Vector)
D-to-D
(CosyVoice Speech Tokens to Discrete Vector)
(Ours)

MNTE Evaluation Dataset

Due to capacity limitations of supplementary materials, this directory contains only a portion of the Multilingual Non-Indo-European Talking Head Evaluation Corpus (MNTE) dataset files. In the future, we will at least publish the source of the videos, so you can download them yourself. After we ensure that there are no copyright issues, we will release all the data for everyone to test.


Arabic
Japanese
Korean
Mandarin
Swahili
Turkish

Ethical Consideration

The rapid advancement of digital human technology, particularly in the creation of highly realistic virtual faces, presents significant ethical challenges. There are genuine concerns about the potential misuse of this technology for malicious purposes, such as deepfakes, identity theft, or the propagation of misinformation. To address these issues, it is crucial that developers and organizations establish comprehensive ethical guidelines before deploying such technologies. These guidelines should encompass principles of user privacy, data protection, and responsible use. Furthermore, to enhance accountability and prevent misuse, it is recommended to implement robust verification systems and content attribution methods for all digitally generated human representations. This could include blockchain-based authentication or secure metadata tagging. By proactively addressing these ethical considerations, we can foster the positive potential of digital human technology while minimizing its risks to individuals and society.