Face-GAN-TTS: An Adversarial-Diffusion Framework for Generating High-Quality Voices from Faces
Deborah Guaia
Cognitive Modeling Group, Wilhelm‑Schickard Institute for Computer Science, University of Tübingen
Reviewers: Prof. Dr. Martin Butz | Prof. Dr. Felix Wichmann
Supervisor: Dr. Christian Gumbsch
Abstract
Face-conditioned text-to-speech (TTS) models open new perspectives for cognitive science experiments, but so far suffer from audible noise and limited quality. While the diffusion model FACE-TTS produces voice-specific prosody from a portrait and text, it also creates noisy artifacts that complicate fine-grained perception studies.
This thesis focuses on overcoming these limitations by improving FACE-TTS using an adversarial approach. To this end, Face-GAN-TTS is presented, a system that optimizes the baseline model using a spectral discriminator. To assess the quality of the generated voices, a listening study and voice analyses are conducted. Training is conducted on 7,358 labeled speakers from the LRS2 dataset. Evaluation is carried out on LRS2 and out-of-domain faces from the Chicago Face Database (CFD).
Forty-five participants each listened to ten audio samples from Face-GAN-TTS and FACE-TTS. The mean opinion score (MOS) for perceived quality increases statistically significantly for Face-GAN-TTS compared to baseline. The quality rises from 2.88 to 3.19 on LRS2 and from 2.30 to 2.56 on CFD faces. At the same time, perceived ”scratchiness“ decreases for Face-GAN-TTS statistically significantly by about 0.7 MOS points for both domains. In general, Face-GAN-TTS reduces noise and increases voice quality.
However, ablation studies suggest that the underlying cross-modal biometric model uses background noise as identity cues. This means that speaker embeddings may not match the face and voice. Nevertheless, Face-GAN-TTS produces clear, low-noise voices using only faces and text. This serves as a basis for low- effort perception studies and multimodal research.
Ablation Studies
- (a) Ground Truth: Reference recording of natural human speech.
- (b) FACE‑GAN‑TTS: Generator pretrained on LRS3; full GAN setup finetuned on LRS2 with adversarial learning (learning rate
1e‑8
). - (c) FACE‑TTS Scratch: Baseline FACE‑TTS (Lee et al., 2023) trained from scratch on LRS2, no adversarial learning (adversarial‑loss weight 0.7, learning rate
1e‑4
). - (d) FACE‑TTS Finetuned: Baseline FACE‑TTS pretrained on LRS3 and finetuned on LRS2 without GAN (learning rate
1e‑8
). - (e) FACE‑TTS LRS3‑only: Published baseline FACE‑TTS by Lee et al. (2023) trained exclusively on LRS3 (learning rate
1e‑4
, speaker weight γ = 0.01).
Examples
Ethical Considerations
- Privacy & Consent: All faces originate from publicly available research datasets that permit demonstration and research use.
- Deepfake Risks: The technology could be misused for deception. Every generated voice is water‑marked and is not intended to impersonate real people.
- Bias & Fairness: We evaluate the model on the Chicago Face Database to expose potential biases with respect to gender, age, and ethnicity.
- Responsible Release: Audio and image data are provided solely for scientific reproduction—no commercial use without explicit permission.
Referenced Projects & Citations
- Lee et al., “Lee, J., Chung, J. S., and Chung, S.-W. (2023). Imaginary voice: Face-styled diffusion model for text-to-speech. Project page
- Kong, J., Kim, J., and Bae, J. (2020a). Hifi-gan: Generative adversarial networks for efficient and high fidelity speech synthesis. Advances in neural information processing systems, 33, 17022–17033.
- Ko, M., Kim, E., and Choi, Y.-H. (2024). Adversarial training of denoising diffusion model using dual discriminators for high-fidelity multi-speaker tts. IEEE Open Journal of Signal Processing, 5, 577–587.
- Ma, D. S., Correll, J., and Wittenbrink, B. (2015). The chicago face database: A free stimulus set of faces and norming data. Behavior research methods, 47, 1122–1135.
- Afouras, T., Chung, J. S., Senior, A. W., Vinyals, O., and Zisserman, A. (2018a). Deep audio-visual speech recognition. CoRR, abs/1809.02108.
The full source code for this demo is available on GitHub.