Talking Head from Speech Audio using a Pre-trained Image Generator

被引：8

作者：

Alghamdi, Mohammed M. ^{[1
,2
]}

Wang, He ^{[1
]}

Bulpitt, Andrew J. ^{[1
]}

Hogg, David C. ^{[1
]}

机构：

[1] Univ Leeds, Leeds, W Yorkshire, England

[2] Taif Univ, Taif, Saudi Arabia

来源：

PROCEEDINGS OF THE 30TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2022 | 2022年

关键词：

talking head generation; video generation; audio-driven synthesis;

D O I：

10.1145/3503161.3548101

中图分类号：

TP39 [计算机的应用];

学科分类号：

081203 ; 0835 ;

摘要：

We propose a novel method for generating high-resolution videos of talking-heads from speech audio and a single 'identity' image. Our method is based on a convolutional neural network model that incorporates a pre-trained StyleGAN generator. We model each frame as a point in the latent space of StyleGAN so that a video corresponds to a trajectory through the latent space. Training the network is in two stages. The first stage is to model trajectories in the latent space conditioned on speech utterances. To do this, we use an existing encoder to invert the generator, mapping from each video frame into the latent space. We train a recurrent neural network to map from speech utterances to displacements in the latent space of the image generator. These displacements are relative to the back-projection into the latent space of an identity image chosen from the individuals depicted in the training dataset. In the second stage, we improve the visual quality of the generated videos by tuning the image generator on a single image or a short video of any chosen identity. We evaluate our model on standard measures (PSNR, SSIM, FID and LMD) and show that it significantly outperforms recent state-of-the-art methods on one of two commonly used datasets and gives comparable performance on the other. Finally, we report on ablation experiments that validate the components of the model. The code and videos from experiments can be found at https://mohammedalghamdi.github.io/talking-heads- acm-mm/

引用

页码：5228 / 5236

页数：9

共 50 条

[1] Leveraging Pre-trained BERT for Audio Captioning
Liu, Xubo
Mei, Xinhao
Huang, Qiushi
Sun, Jianyuan
Zhao, Jinzheng
Liu, Haohe
Plumbley, Mark D.
Kilic, Volkan
Wang, Wenwu
[J]. 2022 30TH EUROPEAN SIGNAL PROCESSING CONFERENCE (EUSIPCO 2022), 2022, : 1145 - 1149
[2] Underwater Image Enhancement Using Pre-trained Transformer
Boudiaf, Abderrahmene
Guo, Yuhang
Ghimire, Adarsh
Werghi, Naoufel
De Masi, Giulia
Javed, Sajid
Dias, Jorge
[J]. IMAGE ANALYSIS AND PROCESSING, ICIAP 2022, PT III, 2022, 13233 : 480 - 488
[3] Using various pre-trained models for audio feature extraction in automated audio captioning
Won, Hyejin
Kim, Baekseung
Kwak, Il-Youp
Lim, Changwon
[J]. EXPERT SYSTEMS WITH APPLICATIONS, 2023, 231
[4] Classification of Respiration Sounds Using Deep Pre-trained Audio Embeddings
Meza, Carlos A. Galindo
del Hoyo Ontiveros, Juan A.
Lopez-Meyer, Paulo
[J]. 2021 IEEE LATIN AMERICAN CONFERENCE ON COMPUTATIONAL INTELLIGENCE (LA-CCI), 2021,
[5] Comparison of Pre-Trained CNNs for Audio Classification Using Transfer Learning
Tsalera, Eleni
Papadakis, Andreas
Samarakou, Maria
[J]. JOURNAL OF SENSOR AND ACTUATOR NETWORKS, 2021, 10 (04)
[6] Pre-Trained Image Processing Transformer
Chen, Hanting
Wang, Yunhe
Guo, Tianyu
Xu, Chang
Deng, Yiping
Liu, Zhenhua
Ma, Siwei
Xu, Chunjing
Xu, Chao
Gao, Wen
[J]. 2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, : 12294 - 12305
[7] EXPLORING PRE-TRAINED NEURAL AUDIO REPRESENTATIONS FOR AUDIO TOPIC SEGMENTATION
Ghinassi, Iacopo
Purver, Matthew
Phan, Huy
Newell, Chris
[J]. 2023 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO, ICME, 2023, : 1086 - 1091
[8] Detecting Dementia from Transcribed Speech in Slovak using Pre-trained BERT Models
Stas, Jan
Hladek, Daniel
Kopnicky, Ales
[J]. 2024 34TH INTERNATIONAL CONFERENCE RADIOELEKTRONIKA, RADIOELEKTRONIKA 2024, 2024,
[9] StyleAutoEncoder for Manipulating Image Attributes Using Pre-trained StyleGAN
Bedychaj, Andrzej
Tabor, Jacek
Smieja, Marek
[J]. ADVANCES IN KNOWLEDGE DISCOVERY AND DATA MINING, PT II, PAKDD 2024, 2024, 14646 : 118 - 130
[10] BYOL for Audio: Exploring Pre-Trained General-Purpose Audio Representations
Niizumi, Daisuke
Takeuchi, Daiki
Ohishi, Yasunori
Harada, Noboru
Kashino, Kunio
[J]. IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2023, 31 : 137 - 151

← 1 2 3 4 5 →