Predicting Head Pose from Speech with a Conditional Variational Autoencoder

被引：20

作者：

Greenwood, David ^{[1
]}

Laycock, Stephen ^{[1
]}

Matthews, Iain ^{[1
]}

机构：

[1] Univ East Anglia, Sch Comp Sci, Norwich, Norfolk, England

来源：

18TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2017), VOLS 1-6: SITUATED INTERACTION | 2017年

关键词：

speech animation; head motion synthesis; visual prosody; generative models; BLSTM; CVAE; NETWORKS;

D O I：

10.21437/Interspeech.2017-894

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Natural movement plays a significant role in realistic speech animation. Numerous studies have demonstrated the contribution visual cues make to the degree we. as human observers, find an animation acceptable. Rigid head motion is one visual mode that universally co-occurs with speech, and so it is a reasonable strategy to seek a transformation from the speech mode to predict the head pose. Several previous authors have shown that prediction is possible, but experiments are typically confined to rigidly produced dialogue. Natural, expressive, emotive and prosodic speech exhibit motion patterns that are far more difficult to predict with considerable variation in expected head pose. Recently, Long Short Term Memory (LSTM) networks have become an important tool for modelling speech and natural language tasks. We employ Deep Bi-Directional LSTMs (BLSTM) capable of learning long-term structure in language, to model the relationship that speech has with rigid head motion. We then extend our model by conditioning with prior motion. Finally, we introduce a generative head motion model, conditioned on audio features using a Conditional Variational Autoencoder (CVAE). Each approach mitigates the problems of the one to many mapping that a speech to head pose model must accommodate.

引用

页码：3991 / 3995

页数：5

共 50 条

[41] Depth-Aware Object Tracking With a Conditional Variational Autoencoder
Huang, Wenhui
Gu, Jason
Guo, Yinchen
[J]. IEEE ACCESS, 2021, 9 : 94537 - 94547
[42] Predicting the quality of a machined workpiece with a variational autoencoder approach
Proteau, Antoine
Tahan, Antoine
Zemouri, Ryad
Thomas, Marc
[J]. JOURNAL OF INTELLIGENT MANUFACTURING, 2023, 34 (02) : 719 - 737
[43] Predicting chemotherapy response using a variational autoencoder approach
Qi Wei
Stephen A. Ramsey
[J]. BMC Bioinformatics, 22
[44] Predicting chemotherapy response using a variational autoencoder approach
Wei, Qi
Ramsey, Stephen A.
[J]. BMC BIOINFORMATICS, 2021, 22 (01)
[45] Predicting the quality of a machined workpiece with a variational autoencoder approach
Antoine Proteau
Antoine Tahan
Ryad Zemouri
Marc Thomas
[J]. Journal of Intelligent Manufacturing, 2023, 34 : 719 - 737
[46] Predicting spectroscopic properties of fluorescent proteins with a variational autoencoder
Taumoefolau, Grace H.
Best, Robert B.
[J]. BIOPHYSICAL JOURNAL, 2022, 121 (03) : 156A - 157A
[47] Predicting Head Pose in Dyadic Conversation
Greenwood, David
Laycock, Stephen
Matthews, Iain
[J]. INTELLIGENT VIRTUAL AGENTS, IVA 2017, 2017, 10498 : 160 - 169
[48] TOWARDS CONDITIONAL ADVERSARIAL TRAINING FOR PREDICTING EMOTIONS FROM SPEECH
Han, Jing
Zhang, Zixing
Ren, Zhao
Ringeval, Fabien
Schuller, Bjoern
[J]. 2018 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2018, : 6822 - 6826
[49] Whisper Speech Enhancement Using Joint Variational Autoencoder for Improved Speech Recognition
Agrawal, Vikas
Kumar, Shashi
Rath, Shakti P.
[J]. INTERSPEECH 2021, 2021, : 2706 - 2710
[50] Variational Autoencoder with Global- and Medium Timescale Auxiliaries for Emotion Recognition from Speech
Almotlak, Hussam
Weber, Cornelius
Qu, Leyuan
Wermter, Stefan
[J]. ARTIFICIAL NEURAL NETWORKS AND MACHINE LEARNING, ICANN 2020, PT I, 2020, 12396 : 529 - 540

← 1 2 3 4 5 →