ProsAudit, a prosodic benchmark for self-supervised speech models

被引：0

作者：

de Seyssel, Maureen ^{[1
,2
]}

Lavechin, Marvin ^{[1
,6
]}

Titeux, Hadrien ^{[1
]}

Thomas, Arthur ^{[7
]}

Virlet, Gwendal ^{[5
,7
]}

Revilla, Andrea Santos ^{[7
]}

Wisniewski, Guillaume ^{[2
]}

Ludusan, Bogdan ^{[3
,4
]}

Dupoux, Emmanuel ^{[1
,6
]}

机构：

[1] PSL Res Univ, Cognit Machine Learning, EHESS, ENS,CNRS,INRIA, Paris, France

[2] Univ Paris Cite, CNRS, Lab Linguist Formelle, Paris, France

[3] Bielefeld Univ, Fac Linguist & Literary Studies, Bielefeld, Germany

[4] Bielefeld Univ, CITEC, Bielefeld, Germany

[5] INRAE, Inst Agro, PEGASE, St Gilles, France

[6] Meta AI Res, Paris, France

[7] CoML, Paris, France

来源：

INTERSPEECH 2023 | 2023年

关键词：

prosody; speech representation; self-supervised learning; human evaluation; SPOKEN LANGUAGE;

D O I：

10.21437/Interspeech.2023-438

中图分类号：

O42 [声学];

学科分类号：

070206 ; 082403 ;

摘要：

We present ProsAudit, a benchmark in English to assess structural prosodic knowledge in self-supervised learning (SSL) speech models. It consists of two subtasks, their corresponding metrics, and an evaluation dataset. In the protosyntax task, the model must correctly identify strong versus weak prosodic boundaries. In the lexical task, the model needs to correctly distinguish between pauses inserted between words and within words. We also provide human evaluation scores on this benchmark. We evaluated a series of SSL models and found that they were all able to perform above chance on both tasks, even when evaluated on an unseen language. However, non-native models performed significantly worse than native ones on the lexical task, highlighting the importance of lexical knowledge in this task. We also found a clear effect of size with models trained on more data performing better in the two subtasks.

引用

页码：2963 / 2967

页数：5

共 50 条

[21] Towards Improving NAM-to-Speech Synthesis Intelligibility using Self-Supervised Speech Models
Shah, Neil
Karande, Shirish
Gandhi, Vineet
INTERSPEECH 2024, 2024, : 2470 - 2474
[22] ANALYSIS OF SELF-SUPERVISED SPEECH MODELS ON CHILDREN'S SPEECH AND INFANT VOCALIZATIONS<bold> </bold>
Li, Jialu
Hasegawa-Johnson, Mark
McElwain, Nancy L.
2024 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING WORKSHOPS, ICASSPW 2024, 2024, : 550 - 554
[23] Feature Normalization for Fine-tuning Self-Supervised Models in Speech Enhancement
Yang, Hejung
Kang, Hong-Goo
INTERSPEECH 2023, 2023, : 814 - 818
[24] SpeechGLUE: HowWell Can Self-Supervised Speech Models Capture Linguistic Knowledge?
Ashihara, Takanori
Moriya, Takafumi
Matsuura, Kohei
Tanaka, Tomohiro
Ijima, Yusuke
Asami, Taichi
Delcroix, Marc
Honma, Yukinori
INTERSPEECH 2023, 2023, : 2888 - 2892
[25] CHAPTER: EXPLOITING CONVOLUTIONAL NEURAL NETWORK ADAPTERS FOR SELF-SUPERVISED SPEECH MODELS
Chen, Zih-Ching
Sung, Yu-Shun
Lee, Hung-yi
2023 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING WORKSHOPS, ICASSPW, 2023,
[26] Using Large Self-Supervised Models for Low-Resource Speech Recognition
Krishna, D. N.
Wang, Pinyi
Bozza, Bruno
INTERSPEECH 2021, 2021, : 2436 - 2440
[27] Investigation of Ensemble features of Self-Supervised Pretrained Models for Automatic Speech Recognition
Arunkumar, A.
Sukhadia, Vrunda Nileshkumar
Umesh, Srinivasan
INTERSPEECH 2022, 2022, : 5145 - 5149
[28] A Study of Gender Impact in Self-supervised Models for Speech-to-Text Systems
Boito, Marcely Zanon
Besacier, Laurent
Tomashenko, Natalia
Esteve, Yannick
INTERSPEECH 2022, 2022, : 1278 - 1282
[29] Self-Supervised Models of Audio Effectively Explain Human Cortical Responses to Speech
Vaidya, Aditya R.
Jain, Shailee
Huth, Alexander G.
INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 162, 2022,
[30] Learnable Layer Selection and Model Fusion for Speech Self-Supervised Learning Models
Chiu, Sheng-Chieh
Wu, Chia-Hua
Hsieh, Jih-Kang
Tsao, Yu
Wang, Hsin-Min
INTERSPEECH 2024, 2024, : 3914 - 3918

← 1 2 3 4 5 →