Speech2Video: Cross-Modal Distillation for Speech to Video Generation

被引:5
|
作者
Si, Shijing [1 ]
Wang, Jianzong [1 ]
Qu, Xiaoyang [1 ]
Cheng, Ning [1 ]
Wei, Wenqi [1 ]
Zhu, Xinghua [1 ]
Xiao, Jing [1 ]
机构
[1] Ping An Technol Shenzhen Co Ltd, Shenzhen, Peoples R China
来源
关键词
Video generation; generative adversarial network; distillation; unsupervised learning; representation learning;
D O I
10.21437/Interspeech.2021-1996
中图分类号
R36 [病理学]; R76 [耳鼻咽喉科学];
学科分类号
100104 ; 100213 ;
摘要
This paper investigates a novel task of talking face video generation solely from speeches. The speech-to-video generation technique can spark interesting applications in entertainment, customer service, and human-computer-interaction industries. Indeed, the timbre, accent and speed in speeches could contain rich information relevant to speakers' appearance. The challenge mainly lies in disentangling the distinct visual attributes from audio signals. In this article, we propose a light-weight, cross-modal distillation method to extract disentangled emotional and identity information from unlabelled video inputs. The extracted features are then integrated by a generative adversarial network into talking face video clips. With carefully crafted discriminators, the proposed framework achieves realistic generation results. Experiments with observed individuals demonstrated that the proposed framework captures the emotional expressions solely from speeches, and produces spontaneous facial motion in the video output. Compared to the baseline method where speeches are combined with a static image of the speaker, the results of the proposed framework is almost indistinguishable. User studies also show that the proposed method outperforms the existing algorithms in terms of emotion expression in the generated videos.
引用
收藏
页码:1629 / 1633
页数:5
相关论文
共 50 条
  • [21] Cross-Modal Interaction Network for Video Moment Retrieval
    Ping, Shen
    Jiang, Xiao
    Tian, Zean
    Cao, Ronghui
    Chi, Weiming
    Yang, Shenghong
    [J]. INTERNATIONAL JOURNAL OF PATTERN RECOGNITION AND ARTIFICIAL INTELLIGENCE, 2023, 37 (08)
  • [22] Video-Based Cross-Modal Recipe Retrieval
    Cao, Da
    Yu, Zhiwang
    Zhang, Hanling
    Fang, Jiansheng
    Nie, Liqiang
    Tian, Qi
    [J]. PROCEEDINGS OF THE 27TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA (MM'19), 2019, : 1685 - 1693
  • [23] VCVTS: MULTI-SPEAKER VIDEO-TO-SPEECH SYNTHESIS VIA CROSS-MODAL KNOWLEDGE TRANSFER FROM VOICE CONVERSION
    Wang, Disong
    Yang, Shan
    Su, Dan
    Liu, Xunying
    Yu, Dong
    Meng, Helen
    [J]. 2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 7252 - 7256
  • [24] Cross-modal correspondences in sine wave: Speech versus non-speech modes
    Rodrigues Silva, Daniel Marcio
    Bellini-Leite, Samuel C.
    [J]. ATTENTION PERCEPTION & PSYCHOPHYSICS, 2020, 82 (03) : 944 - 953
  • [25] Cross-modal correspondences in sine wave: Speech versus non-speech modes
    Daniel Márcio Rodrigues Silva
    Samuel C. Bellini-Leite
    [J]. Attention, Perception, & Psychophysics, 2020, 82 : 944 - 953
  • [26] END-TO-END VOICE CONVERSION VIA CROSS-MODAL KNOWLEDGE DISTILLATION FOR DYSARTHRIC SPEECH RECONSTRUCTION
    Wang, Disong
    Yu, Jianwei
    Wu, Xixin
    Liu, Songxiang
    Sung, Lifa
    Liu, Xunying
    Meng, Helen
    [J]. 2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, : 7744 - 7748
  • [27] Video and audio are images: A cross-modal mixer for original data on video–audio retrieval
    Software College, Northeastern University, Shenyang
    110819, China
    [J]. Knowl Based Syst, 2024,
  • [28] Weakly Supervised Dense Video Captioning via Jointly Usage of Knowledge Distillation and Cross-modal Matching
    Wu, Bofeng
    Niu, Guocheng
    Yu, Jun
    Xiao, Xinyan
    Zhang, Jian
    Wu, Hua
    [J]. PROCEEDINGS OF THE THIRTIETH INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, IJCAI 2021, 2021, : 1157 - 1164
  • [29] Cross-Modal Video Retrieval Model Based on Video-Text Dual Alignment
    Che, Zhanbin
    Guo, Huaili
    [J]. INTERNATIONAL JOURNAL OF ADVANCED COMPUTER SCIENCE AND APPLICATIONS, 2024, 15 (02) : 303 - 311
  • [30] Infants Detect Cross-modal Cues to Identity in Speech and Singing
    Trehub, Sandra E.
    Plantinga, Judy
    Brcic, Jelena
    [J]. NEUROSCIENCES AND MUSIC III: DISORDERS AND PLASTICITY, 2009, 1169 : 508 - 511