Speech2Video: Cross-Modal Distillation for Speech to Video Generation

被引:5
|
作者
Si, Shijing [1 ]
Wang, Jianzong [1 ]
Qu, Xiaoyang [1 ]
Cheng, Ning [1 ]
Wei, Wenqi [1 ]
Zhu, Xinghua [1 ]
Xiao, Jing [1 ]
机构
[1] Ping An Technol Shenzhen Co Ltd, Shenzhen, Peoples R China
来源
关键词
Video generation; generative adversarial network; distillation; unsupervised learning; representation learning;
D O I
10.21437/Interspeech.2021-1996
中图分类号
R36 [病理学]; R76 [耳鼻咽喉科学];
学科分类号
100104 ; 100213 ;
摘要
This paper investigates a novel task of talking face video generation solely from speeches. The speech-to-video generation technique can spark interesting applications in entertainment, customer service, and human-computer-interaction industries. Indeed, the timbre, accent and speed in speeches could contain rich information relevant to speakers' appearance. The challenge mainly lies in disentangling the distinct visual attributes from audio signals. In this article, we propose a light-weight, cross-modal distillation method to extract disentangled emotional and identity information from unlabelled video inputs. The extracted features are then integrated by a generative adversarial network into talking face video clips. With carefully crafted discriminators, the proposed framework achieves realistic generation results. Experiments with observed individuals demonstrated that the proposed framework captures the emotional expressions solely from speeches, and produces spontaneous facial motion in the video output. Compared to the baseline method where speeches are combined with a static image of the speaker, the results of the proposed framework is almost indistinguishable. User studies also show that the proposed method outperforms the existing algorithms in terms of emotion expression in the generated videos.
引用
收藏
页码:1629 / 1633
页数:5
相关论文
共 50 条
  • [1] Speech to Text Adaptation: Towards an Efficient Cross-Modal Distillation
    Cho, Won Ik
    Kwak, Donghyun
    Yoon, Ji Won
    Kim, Nam Soo
    [J]. INTERSPEECH 2020, 2020, : 896 - 900
  • [2] Cross-Modal Knowledge Distillation Method for Automatic Cued Speech Recognition
    Wang, Jianrong
    Tang, Ziyue
    Li, Xuewei
    Yu, Mei
    Fang, Qiang
    Liu, Li
    [J]. INTERSPEECH 2021, 2021, : 2986 - 2990
  • [3] Cross-Modal Prediction in Speech Perception
    Sanchez-Garcia, Carolina
    Alsius, Agnes
    Enns, James T.
    Soto-Faraco, Salvador
    [J]. PLOS ONE, 2011, 6 (10):
  • [4] Cross-Modal Effects in Speech Perception
    Keough, Megan
    Derrick, Donald
    Gick, Bryan
    [J]. ANNUAL REVIEW OF LINGUISTICS, VOL 5, 2019, 5 : 49 - 66
  • [5] Cross-Modal Dual Learning for Sentence-to-Video Generation
    Liu, Yue
    Wang, Xin
    Yuan, Yitian
    Zhu, Wenwu
    [J]. PROCEEDINGS OF THE 27TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA (MM'19), 2019, : 1239 - 1247
  • [6] Cross-modal facilitation in speech prosody
    Foxton, Jessica M.
    Riviere, Louis-David
    Barone, Pascal
    [J]. COGNITION, 2010, 115 (01) : 71 - 78
  • [7] Electroglottograph-Based Speech Emotion Recognition via Cross-Modal Distillation
    Chen, Lijiang
    Ren, Jie
    Mao, Xia
    Zhao, Qi
    [J]. APPLIED SCIENCES-BASEL, 2022, 12 (09):
  • [8] Speech Emotion Recognition via Multi-Level Cross-Modal Distillation
    Li, Ruichen
    Zhao, Jinming
    Jin, Qin
    [J]. INTERSPEECH 2021, 2021, : 4488 - 4492
  • [9] XKD: Cross-Modal Knowledge Distillation with Domain Alignment for Video Representation Learning
    Sarkar, Pritam
    Etemad, Ali
    [J]. THIRTY-EIGHTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 38 NO 13, 2024, : 14875 - 14885
  • [10] Cross-modal signatures in maternal speech and singing
    Trehub, Sandra E.
    Plantinga, Judy
    Brcic, Jelena
    Nowicki, Magda
    [J]. FRONTIERS IN PSYCHOLOGY, 2013, 4