Multimodal attention for lip synthesis using conditional generative adversarial networks

被引:0
|
作者
Vidal, Andrea [1 ]
Busso, Carlos [1 ]
机构
[1] Univ Texas Dallas, Dept Elect & Comp Engn, 800 W Campbell Rd, Richardson, TX 75080 USA
基金
美国国家科学基金会;
关键词
Speech-driven animations; Socially interactive agents; Conditional GAN; Lip movements; Cross-modal attention; Attention mechanism; FACIAL ANIMATION; HEAD MOTION; SPEECH; DRIVEN;
D O I
10.1016/j.specom.2023.102959
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
The synthesis of lip movements is an important problem for a socially interactive agent (SIA). It is important to generate lip movements that are synchronized with speech and have realistic co-articulation. We hypothesize that combining lexical information (i.e., sequence of phonemes) and acoustic features can lead not only to models that generate the correct lip movements matching the articulatory movements, but also to trajectories that are well synchronized with the speech emphasis and emotional content. This work presents attention -based frameworks that use acoustic and lexical information to enhance the synthesis of lip movements. The lexical information is obtained from automatic speech recognition (ASR) transcriptions, broadening the range of applications of the proposed solution. We propose models based on conditional generative adversarial networks (CGAN) with self-modality attention and cross-modalities attention mechanisms. These models allow us to understand which frames are considered more in the generation of lip movements. We animate the synthesized lip movements using blendshapes. These animations are used to compare our proposed multimodal models with alternative methods, including unimodal models implemented with either text or acoustic features. We rely on subjective metrics using perceptual evaluations and an objective metric based on the LipSync model. The results show that our proposed models with attention mechanisms are preferred over the baselines on the perception of naturalness. The addition of cross-modality attentions and self-modality attentions has a significant positive impact on the performance of the generated sequences. We observe that lexical information provides valuable information even when the transcriptions are not perfect. The improved performance observed by the multimodal system confirms the complementary information provided by the speech and text modalities.
引用
收藏
页数:12
相关论文
共 50 条
  • [1] Enhanced dataset synthesis using conditional generative adversarial networks
    Mert, Ahmet
    [J]. BIOMEDICAL ENGINEERING LETTERS, 2023, 13 (01) : 41 - 48
  • [2] Enhanced dataset synthesis using conditional generative adversarial networks
    Ahmet Mert
    [J]. Biomedical Engineering Letters, 2023, 13 : 41 - 48
  • [3] Attention-aware conditional generative adversarial networks for facial age synthesis
    Chen, Xiahui
    Sun, Yunlian
    Shu, Xiangbo
    Li, Qi
    [J]. NEUROCOMPUTING, 2021, 451 : 167 - 180
  • [4] Multimodal MRI synthesis using unified generative adversarial networks
    Dai, Xianjin
    Lei, Yang
    Fu, Yabo
    Curran, Walter J.
    Liu, Tian
    Mao, Hui
    Yang, Xiaofeng
    [J]. MEDICAL PHYSICS, 2020, 47 (12) : 6343 - 6354
  • [5] Synthesis of Multimodal Cardiological Signals Using a Conditional Wasserstein Generative Adversarial Network
    Cretu, Ioana
    Tindale, Alexander
    Balachandran, Wamadeva
    Abbod, Maysam
    William Khir, Ashraf
    Meng, Hongying
    [J]. IEEE Access, 2024, 12 : 133994 - 134007
  • [6] Unpaired font family synthesis using conditional generative adversarial networks
    Ul Hassan, Ammar
    Ahmed, Hammad
    Choi, Jaeyoung
    [J]. KNOWLEDGE-BASED SYSTEMS, 2021, 229
  • [7] MelGAN: Generative Adversarial Networks for Conditional Waveform Synthesis
    Kumar, Kundan
    Kumar, Rithesh
    de Boissiere, Thibault
    Gestin, Lucas
    Teoh, Wei Zhen
    Sotelo, Jose
    de Brebisson, Alexandre
    Bengio, Yoshua
    Courville, Aaron
    [J]. ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 32 (NIPS 2019), 2019, 32
  • [8] Text to Image Synthesis Using Stacked Conditional Variational Autoencoders and Conditional Generative Adversarial Networks
    Tibebu, Haileleol
    Malik, Aadin
    De Silva, Varuna
    [J]. INTELLIGENT COMPUTING, VOL 1, 2022, 506 : 560 - 580
  • [9] Conditional Independence Testing using Generative Adversarial Networks
    Bellot, Alexis
    van der Schaar, Mihaela
    [J]. ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 32 (NIPS 2019), 2019, 32
  • [10] Clustering Using Conditional Generative Adversarial Networks (cGANs)
    Ruzicka, Marek
    Dopiriak, Matus
    [J]. 2023 33RD INTERNATIONAL CONFERENCE RADIOELEKTRONIKA, RADIOELEKTRONIKA, 2023,