Speech2Video: Cross-Modal Distillation for Speech to Video Generation

被引:5
|
作者
Si, Shijing [1 ]
Wang, Jianzong [1 ]
Qu, Xiaoyang [1 ]
Cheng, Ning [1 ]
Wei, Wenqi [1 ]
Zhu, Xinghua [1 ]
Xiao, Jing [1 ]
机构
[1] Ping An Technol Shenzhen Co Ltd, Shenzhen, Peoples R China
来源
关键词
Video generation; generative adversarial network; distillation; unsupervised learning; representation learning;
D O I
10.21437/Interspeech.2021-1996
中图分类号
R36 [病理学]; R76 [耳鼻咽喉科学];
学科分类号
100104 ; 100213 ;
摘要
This paper investigates a novel task of talking face video generation solely from speeches. The speech-to-video generation technique can spark interesting applications in entertainment, customer service, and human-computer-interaction industries. Indeed, the timbre, accent and speed in speeches could contain rich information relevant to speakers' appearance. The challenge mainly lies in disentangling the distinct visual attributes from audio signals. In this article, we propose a light-weight, cross-modal distillation method to extract disentangled emotional and identity information from unlabelled video inputs. The extracted features are then integrated by a generative adversarial network into talking face video clips. With carefully crafted discriminators, the proposed framework achieves realistic generation results. Experiments with observed individuals demonstrated that the proposed framework captures the emotional expressions solely from speeches, and produces spontaneous facial motion in the video output. Compared to the baseline method where speeches are combined with a static image of the speaker, the results of the proposed framework is almost indistinguishable. User studies also show that the proposed method outperforms the existing algorithms in terms of emotion expression in the generated videos.
引用
收藏
页码:1629 / 1633
页数:5
相关论文
共 50 条
  • [41] Cross-modal distraction by background speech: What role for meaning?
    Marsh, John E.
    Jones, Dylan M.
    [J]. NOISE & HEALTH, 2010, 12 (49): : 210 - 216
  • [42] Contextual and Cross-Modal Interaction for Multi-Modal Speech Emotion Recognition
    Yang, Dingkang
    Huang, Shuai
    Liu, Yang
    Zhang, Lihua
    [J]. IEEE SIGNAL PROCESSING LETTERS, 2022, 29 : 2093 - 2097
  • [43] Audiovisual Speech Enhancement via Cross-Modal Suppression of Auditory Association Cortex by Visual Speech
    Karas, Patrick J.
    Magnotti, John F.
    Wang, Zhengjia
    Metzger, Brian A.
    Yoshor, Daniel
    Beauchamp, Michael S.
    [J]. NEUROSURGERY, 2019, 66 : 156 - 157
  • [44] Looking into Your Speech: Learning Cross-modal Affinity for Audio-visual Speech Separation
    Lee, Jiyoung
    Chung, Soo-Whan
    Kim, Sunok
    Kang, Hong-Goo
    Sohn, Kwanghoon
    [J]. 2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, : 1336 - 1345
  • [45] CueVideo: A system for cross-modal search and browse of video databases
    Syeda-Mahmood, T
    Srinivasan, S
    Amir, A
    Ponceleon, D
    Blanchard, B
    Petkovic, D
    [J]. IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, PROCEEDINGS, VOL II, 2000, : 786 - 787
  • [46] A Short Video Classification Framework Based on Cross-Modal Fusion
    Pang, Nuo
    Guo, Songlin
    Yan, Ming
    Chan, Chien Aun
    [J]. SENSORS, 2023, 23 (20)
  • [47] Cross-Modal learning for Audio-Visual Video Parsing
    Lamba, Jatin
    Abhishek
    Akula, Jayaprakash
    Dabral, Rishabh
    Jyothi, Preethi
    Ramakrishnan, Ganesh
    [J]. INTERSPEECH 2021, 2021, : 1937 - 1941
  • [48] Learning Cross-Modal Contrastive Features for Video Domain Adaptation
    Kim, Donghyun
    Tsai, Yi-Hsuan
    Zhuang, Bingbing
    Yu, Xiang
    Sclaroff, Stan
    Saenko, Kate
    Chandraker, Manmohan
    [J]. 2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, : 13598 - 13607
  • [49] Video Moment Localization via Deep Cross-Modal Hashing
    Hu, Yupeng
    Liu, Meng
    Su, Xiaobin
    Gao, Zan
    Nie, Liqiang
    [J]. IEEE TRANSACTIONS ON IMAGE PROCESSING, 2021, 30 : 4667 - 4677
  • [50] Lightweight recurrent cross-modal encoder for video question answering
    Immanuel, Steve Andreas
    Jeong, Cheol
    [J]. KNOWLEDGE-BASED SYSTEMS, 2023, 276