Face-GAN-TTS: An Adversarial-Diffusion Framework for Generating High-Quality Voices from Faces

Deborah Guaia

Cognitive Modeling Group, Wilhelm‑Schickard Institute for Computer Science, University of Tübingen

Reviewers: Prof. Dr. Martin Butz  |  Prof. Dr. Felix Wichmann

Supervisor: Dr. Christian Gumbsch

Abstract

Face-conditioned text-to-speech (TTS) models open new perspectives for cognitive science experiments, but so far suffer from audible noise and limited quality. While the diffusion model FACE-TTS produces voice-specific prosody from a portrait and text, it also creates noisy artifacts that complicate fine-grained perception studies.

This thesis focuses on overcoming these limitations by improving FACE-TTS using an adversarial approach. To this end, Face-GAN-TTS is presented, a system that optimizes the baseline model using a spectral discriminator. To assess the quality of the generated voices, a listening study and voice analyses are conducted. Training is conducted on 7,358 labeled speakers from the LRS2 dataset. Evaluation is carried out on LRS2 and out-of-domain faces from the Chicago Face Database (CFD).

Forty-five participants each listened to ten audio samples from Face-GAN-TTS and FACE-TTS. The mean opinion score (MOS) for perceived quality increases statistically significantly for Face-GAN-TTS compared to baseline. The quality rises from 2.88 to 3.19 on LRS2 and from 2.30 to 2.56 on CFD faces. At the same time, perceived ”scratchiness“ decreases for Face-GAN-TTS statistically significantly by about 0.7 MOS points for both domains. In general, Face-GAN-TTS reduces noise and increases voice quality.

However, ablation studies suggest that the underlying cross-modal biometric model uses background noise as identity cues. This means that speaker embeddings may not match the face and voice. Nevertheless, Face-GAN-TTS produces clear, low-noise voices using only faces and text. This serves as a basis for low- effort perception studies and multimodal research.

Ablation Studies

Figure 1: Mel‑spectrogram comparison – (a) Ground Truth, (b) Face‑GAN‑TTS, (c) FACE‑TTS Scratch, (d) FACE‑TTS Finetuned, (e) FACE‑TTS trained on LRS3.
  • (a) Ground Truth: Reference recording of natural human speech.
  • (b) FACE‑GAN‑TTS: Generator pretrained on LRS3; full GAN setup finetuned on LRS2 with adversarial learning (learning rate 1e‑8).
  • (c) FACE‑TTS Scratch: Baseline FACE‑TTS (Lee et al., 2023) trained from scratch on LRS2, no adversarial learning (adversarial‑loss weight 0.7, learning rate 1e‑4).
  • (d) FACE‑TTS Finetuned: Baseline FACE‑TTS pretrained on LRS3 and finetuned on LRS2 without GAN (learning rate 1e‑8).
  • (e) FACE‑TTS LRS3‑only: Published baseline FACE‑TTS by Lee et al. (2023) trained exclusively on LRS3 (learning rate 1e‑4, speaker weight γ = 0.01).

Examples

(b) Face‑GAN‑TTS
(c) FACE‑TTS Scratch
(d) FACE‑TTS Finetuned
(e) FACE‑TTS trained on LRS3

Ethical Considerations

Referenced Projects & Citations

The full source code for this demo is available on GitHub.