Talking Head from Speech Audio using a Pre-trained Image Generator

被引:8
|
作者
Alghamdi, Mohammed M. [1 ,2 ]
Wang, He [1 ]
Bulpitt, Andrew J. [1 ]
Hogg, David C. [1 ]
机构
[1] Univ Leeds, Leeds, W Yorkshire, England
[2] Taif Univ, Taif, Saudi Arabia
关键词
talking head generation; video generation; audio-driven synthesis;
D O I
10.1145/3503161.3548101
中图分类号
TP39 [计算机的应用];
学科分类号
081203 ; 0835 ;
摘要
We propose a novel method for generating high-resolution videos of talking-heads from speech audio and a single 'identity' image. Our method is based on a convolutional neural network model that incorporates a pre-trained StyleGAN generator. We model each frame as a point in the latent space of StyleGAN so that a video corresponds to a trajectory through the latent space. Training the network is in two stages. The first stage is to model trajectories in the latent space conditioned on speech utterances. To do this, we use an existing encoder to invert the generator, mapping from each video frame into the latent space. We train a recurrent neural network to map from speech utterances to displacements in the latent space of the image generator. These displacements are relative to the back-projection into the latent space of an identity image chosen from the individuals depicted in the training dataset. In the second stage, we improve the visual quality of the generated videos by tuning the image generator on a single image or a short video of any chosen identity. We evaluate our model on standard measures (PSNR, SSIM, FID and LMD) and show that it significantly outperforms recent state-of-the-art methods on one of two commonly used datasets and gives comparable performance on the other. Finally, we report on ablation experiments that validate the components of the model. The code and videos from experiments can be found at https://mohammedalghamdi.github.io/talking-heads- acm-mm/
引用
收藏
页码:5228 / 5236
页数:9
相关论文
共 50 条
  • [1] Leveraging Pre-trained BERT for Audio Captioning
    Liu, Xubo
    Mei, Xinhao
    Huang, Qiushi
    Sun, Jianyuan
    Zhao, Jinzheng
    Liu, Haohe
    Plumbley, Mark D.
    Kilic, Volkan
    Wang, Wenwu
    [J]. 2022 30TH EUROPEAN SIGNAL PROCESSING CONFERENCE (EUSIPCO 2022), 2022, : 1145 - 1149
  • [2] Underwater Image Enhancement Using Pre-trained Transformer
    Boudiaf, Abderrahmene
    Guo, Yuhang
    Ghimire, Adarsh
    Werghi, Naoufel
    De Masi, Giulia
    Javed, Sajid
    Dias, Jorge
    [J]. IMAGE ANALYSIS AND PROCESSING, ICIAP 2022, PT III, 2022, 13233 : 480 - 488
  • [3] Using various pre-trained models for audio feature extraction in automated audio captioning
    Won, Hyejin
    Kim, Baekseung
    Kwak, Il-Youp
    Lim, Changwon
    [J]. EXPERT SYSTEMS WITH APPLICATIONS, 2023, 231
  • [4] Classification of Respiration Sounds Using Deep Pre-trained Audio Embeddings
    Meza, Carlos A. Galindo
    del Hoyo Ontiveros, Juan A.
    Lopez-Meyer, Paulo
    [J]. 2021 IEEE LATIN AMERICAN CONFERENCE ON COMPUTATIONAL INTELLIGENCE (LA-CCI), 2021,
  • [5] Comparison of Pre-Trained CNNs for Audio Classification Using Transfer Learning
    Tsalera, Eleni
    Papadakis, Andreas
    Samarakou, Maria
    [J]. JOURNAL OF SENSOR AND ACTUATOR NETWORKS, 2021, 10 (04)
  • [6] Pre-Trained Image Processing Transformer
    Chen, Hanting
    Wang, Yunhe
    Guo, Tianyu
    Xu, Chang
    Deng, Yiping
    Liu, Zhenhua
    Ma, Siwei
    Xu, Chunjing
    Xu, Chao
    Gao, Wen
    [J]. 2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, : 12294 - 12305
  • [7] EXPLORING PRE-TRAINED NEURAL AUDIO REPRESENTATIONS FOR AUDIO TOPIC SEGMENTATION
    Ghinassi, Iacopo
    Purver, Matthew
    Phan, Huy
    Newell, Chris
    [J]. 2023 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO, ICME, 2023, : 1086 - 1091
  • [8] Detecting Dementia from Transcribed Speech in Slovak using Pre-trained BERT Models
    Stas, Jan
    Hladek, Daniel
    Kopnicky, Ales
    [J]. 2024 34TH INTERNATIONAL CONFERENCE RADIOELEKTRONIKA, RADIOELEKTRONIKA 2024, 2024,
  • [9] StyleAutoEncoder for Manipulating Image Attributes Using Pre-trained StyleGAN
    Bedychaj, Andrzej
    Tabor, Jacek
    Smieja, Marek
    [J]. ADVANCES IN KNOWLEDGE DISCOVERY AND DATA MINING, PT II, PAKDD 2024, 2024, 14646 : 118 - 130
  • [10] BYOL for Audio: Exploring Pre-Trained General-Purpose Audio Representations
    Niizumi, Daisuke
    Takeuchi, Daiki
    Ohishi, Yasunori
    Harada, Noboru
    Kashino, Kunio
    [J]. IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2023, 31 : 137 - 151