Face-GAN-TTS: An Adversarial-Diffusion Framework for Generating High-Quality Voices from Faces

Deborah Guaia

Cognitive Modeling Group, Wilhelm‑Schickard Institute for Computer Science, University of Tübingen

July 28, 2025

Reviewers: Prof. Dr. Martin Butz | Prof. Dr. Felix Wichmann

Supervisor: Dr. Christian Gumbsch

Abstract

Face-conditioned text-to-speech (TTS) models open new perspectives for cognitive science experiments, but so far suffer from audible noise and limited quality. While the diffusion model FACE-TTS produces voice-specific prosody from a portrait and text, it also creates noisy artifacts that complicate fine-grained perception studies.

This thesis focuses on overcoming these limitations by improving FACE-TTS using an adversarial approach. To this end, Face-GAN-TTS is presented, a system that optimizes the baseline model using a spectral discriminator. To assess the quality of the generated voices, a listening study and voice analyses are conducted. Training is conducted on 7,358 labeled speakers from the LRS2 dataset. Evaluation is carried out on LRS2 and out-of-domain faces from the Chicago Face Database (CFD).

Forty-five participants each listened to ten audio samples from Face-GAN-TTS and FACE-TTS. The mean opinion score (MOS) for perceived quality increases statistically significantly for Face-GAN-TTS compared to baseline. The quality rises from 2.88 to 3.19 on LRS2 and from 2.30 to 2.56 on CFD faces. At the same time, perceived ”scratchiness“ decreases for Face-GAN-TTS statistically significantly by about 0.7 MOS points for both domains. In general, Face-GAN-TTS reduces noise and increases voice quality.

However, ablation studies suggest that the underlying cross-modal biometric model uses background noise as identity cues. This means that speaker embeddings may not match the face and voice. Nevertheless, Face-GAN-TTS produces clear, low-noise voices using only faces and text. This serves as a basis for low- effort perception studies and multimodal research.

Ablation Studies

Figure 1: Mel‑spectrogram comparison – (a) Ground Truth, (b) Face‑GAN‑TTS, (c) FACE‑TTS Scratch, (d) FACE‑TTS Finetuned, (e) FACE‑TTS trained on LRS3.

(a) Ground Truth: Reference recording of natural human speech.
(b) FACE‑GAN‑TTS: Generator pretrained on LRS3; full GAN setup finetuned on LRS2 with adversarial learning (learning rate 1e‑8).
(c) FACE‑TTS Scratch: Baseline FACE‑TTS (Lee et al., 2023) trained from scratch on LRS2, no adversarial learning (adversarial‑loss weight 0.7, learning rate 1e‑4).
(d) FACE‑TTS Finetuned: Baseline FACE‑TTS pretrained on LRS3 and finetuned on LRS2 without GAN (learning rate 1e‑8).
(e) FACE‑TTS LRS3‑only: Published baseline FACE‑TTS by Lee et al. (2023) trained exclusively on LRS3 (learning rate 1e‑4, speaker weight γ = 0.01).

Examples

(b) Face‑GAN‑TTS

(c) FACE‑TTS Scratch

(d) FACE‑TTS Finetuned

(e) FACE‑TTS trained on LRS3

Ethical Considerations

Privacy & Consent: All faces originate from publicly available research datasets that permit demonstration and research use.
Deepfake Risks: The technology could be misused for deception. Every generated voice is water‑marked and is not intended to impersonate real people.
Bias & Fairness: We evaluate the model on the Chicago Face Database to expose potential biases with respect to gender, age, and ethnicity.
Responsible Release: Audio and image data are provided solely for scientific reproduction—no commercial use without explicit permission.

Referenced Projects & Citations

Lee et al., “Lee, J., Chung, J. S., and Chung, S.-W. (2023). Imaginary voice: Face-styled diffusion model for text-to-speech. Project page

Kong, J., Kim, J., and Bae, J. (2020a). Hifi-gan: Generative adversarial networks for efficient and high fidelity speech synthesis. Advances in neural information processing systems, 33, 17022–17033.

Ko, M., Kim, E., and Choi, Y.-H. (2024). Adversarial training of denoising diffusion model using dual discriminators for high-fidelity multi-speaker tts. IEEE Open Journal of Signal Processing, 5, 577–587.

Ma, D. S., Correll, J., and Wittenbrink, B. (2015). The chicago face database: A free stimulus set of faces and norming data. Behavior research methods, 47, 1122–1135.

Afouras, T., Chung, J. S., Senior, A. W., Vinyals, O., and Zisserman, A. (2018a). Deep audio-visual speech recognition. CoRR, abs/1809.02108.

The full source code for this demo is available on GitHub.