Speech2Video: Cross-Modal Distillation for Speech to Video Generation

被引：5

作者：

Si, Shijing ^{[1
]}

Wang, Jianzong ^{[1
]}

Qu, Xiaoyang ^{[1
]}

Cheng, Ning ^{[1
]}

Wei, Wenqi ^{[1
]}

Zhu, Xinghua ^{[1
]}

Xiao, Jing ^{[1
]}

机构：

[1] Ping An Technol Shenzhen Co Ltd, Shenzhen, Peoples R China

来源：

INTERSPEECH 2021 | 2021年

关键词：

Video generation; generative adversarial network; distillation; unsupervised learning; representation learning;

D O I：

10.21437/Interspeech.2021-1996

中图分类号：

R36 [病理学]; R76 [耳鼻咽喉科学];

学科分类号：

100104 ; 100213 ;

摘要：

This paper investigates a novel task of talking face video generation solely from speeches. The speech-to-video generation technique can spark interesting applications in entertainment, customer service, and human-computer-interaction industries. Indeed, the timbre, accent and speed in speeches could contain rich information relevant to speakers' appearance. The challenge mainly lies in disentangling the distinct visual attributes from audio signals. In this article, we propose a light-weight, cross-modal distillation method to extract disentangled emotional and identity information from unlabelled video inputs. The extracted features are then integrated by a generative adversarial network into talking face video clips. With carefully crafted discriminators, the proposed framework achieves realistic generation results. Experiments with observed individuals demonstrated that the proposed framework captures the emotional expressions solely from speeches, and produces spontaneous facial motion in the video output. Compared to the baseline method where speeches are combined with a static image of the speaker, the results of the proposed framework is almost indistinguishable. User studies also show that the proposed method outperforms the existing algorithms in terms of emotion expression in the generated videos.

引用

页码：1629 / 1633

页数：5

共 50 条

[21] Cross-Modal Interaction Network for Video Moment Retrieval
Ping, Shen
Jiang, Xiao
Tian, Zean
Cao, Ronghui
Chi, Weiming
Yang, Shenghong
[J]. INTERNATIONAL JOURNAL OF PATTERN RECOGNITION AND ARTIFICIAL INTELLIGENCE, 2023, 37 (08)
[22] Video-Based Cross-Modal Recipe Retrieval
Cao, Da
Yu, Zhiwang
Zhang, Hanling
Fang, Jiansheng
Nie, Liqiang
Tian, Qi
[J]. PROCEEDINGS OF THE 27TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA (MM'19), 2019, : 1685 - 1693
[23] VCVTS: MULTI-SPEAKER VIDEO-TO-SPEECH SYNTHESIS VIA CROSS-MODAL KNOWLEDGE TRANSFER FROM VOICE CONVERSION
Wang, Disong
Yang, Shan
Su, Dan
Liu, Xunying
Yu, Dong
Meng, Helen
[J]. 2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 7252 - 7256
[24] Cross-modal correspondences in sine wave: Speech versus non-speech modes
Rodrigues Silva, Daniel Marcio
Bellini-Leite, Samuel C.
[J]. ATTENTION PERCEPTION & PSYCHOPHYSICS, 2020, 82 (03) : 944 - 953
[25] Cross-modal correspondences in sine wave: Speech versus non-speech modes
Daniel Márcio Rodrigues Silva
Samuel C. Bellini-Leite
[J]. Attention, Perception, & Psychophysics, 2020, 82 : 944 - 953
[26] END-TO-END VOICE CONVERSION VIA CROSS-MODAL KNOWLEDGE DISTILLATION FOR DYSARTHRIC SPEECH RECONSTRUCTION
Wang, Disong
Yu, Jianwei
Wu, Xixin
Liu, Songxiang
Sung, Lifa
Liu, Xunying
Meng, Helen
[J]. 2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, : 7744 - 7748
[27] Video and audio are images: A cross-modal mixer for original data on video–audio retrieval
Software College, Northeastern University, Shenyang
110819, China
[J]. Knowl Based Syst, 2024,
[28] Weakly Supervised Dense Video Captioning via Jointly Usage of Knowledge Distillation and Cross-modal Matching
Wu, Bofeng
Niu, Guocheng
Yu, Jun
Xiao, Xinyan
Zhang, Jian
Wu, Hua
[J]. PROCEEDINGS OF THE THIRTIETH INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, IJCAI 2021, 2021, : 1157 - 1164
[29] Cross-Modal Video Retrieval Model Based on Video-Text Dual Alignment
Che, Zhanbin
Guo, Huaili
[J]. INTERNATIONAL JOURNAL OF ADVANCED COMPUTER SCIENCE AND APPLICATIONS, 2024, 15 (02) : 303 - 311
[30] Infants Detect Cross-modal Cues to Identity in Speech and Singing
Trehub, Sandra E.
Plantinga, Judy
Brcic, Jelena
[J]. NEUROSCIENCES AND MUSIC III: DISORDERS AND PLASTICITY, 2009, 1169 : 508 - 511

← 1 2 3 4 5 →