MULTI-SPEAKER PITCH TRACKING VIA EMBODIED SELF-SUPERVISED LEARNING

被引：1

作者：

Li, Xiang ^{[1
]}

Sun, Yifan

Wu, Xihong

Chen, Jing

机构：

[1] Peking Univ, Speech & Hearing Res Ctr, Dept Machine Intelligence, Beijing, Peoples R China

来源：

2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP) | 2022年

基金：

中国国家自然科学基金;

关键词：

Multi-pitch tracking; self-supervised learning; speech perception; speech production; MULTIPITCH TRACKING; SPEECH;

D O I：

10.1109/ICASSP43922.2022.9747262

中图分类号：

O42 [声学];

学科分类号：

070206 ; 082403 ;

摘要：

Pitch is a critical cue in human speech perception. Although the technology of tracking pitch in single-talker speech succeeds in many applications, it's still a challenging problem to extract pitch information from mixtures. Inspired by the motor theory of speech perception, a novel multi-speaker pitch tracking approach is proposed in this work, based on an embodied self-supervised learning method (EMSSL-Pitch). The conceptual idea is that speech is produced through an underlying physical process (i.e., human vocal tract) given the articulatory parameters (articulatory-to-acoustic), while speech perception is like the inverse process, aiming at perceiving the intended articulatory gestures of the speaker from acoustic signals (acoustic-to-articulatory). Pitch value is part of the articulatory parameters, corresponding to the vibration frequency of vocal folders. The acoustic-to-articulatory inversion is modeled in a self-supervised manner to learn an inference network by iteratively sampling and training. The learned representations from this inference network can have explicit physical meanings, i.e., articulatory parameters where pitch information can be further extracted. Experiments on GRID database show that EMSSL-Pitch can achieve a reachable performance compared with supervised baselines and be generalized to unseen speakers.

引用

页码：8257 / 8261

页数：5

共 50 条

[1] Audio Mixing Inversion via Embodied Self-supervised Learning
Zhou, Haotian
Yu, Feng
Wu, Xihong
[J]. MACHINE INTELLIGENCE RESEARCH, 2024, 21 (01) : 55 - 62
[2] Audio Mixing Inversion via Embodied Self-supervised Learning
Haotian Zhou
Feng Yu
Xihong Wu
[J]. Machine Intelligence Research, 2024, 21 : 55 - 62
[3] Robust Multi-Speaker Tracking via Dictionary Learning and Identity Modeling
Barnard, Mark
Koniusz, Peter
Wang, Wenwu
Kittler, Josef
Naqvi, Syed Mohsen
Chambers, Jonathon
[J]. IEEE TRANSACTIONS ON MULTIMEDIA, 2014, 16 (03) : 864 - 880
[4] Self-Supervised Embodied Learning for Semantic Segmentation
Wang, Juan
Liu, Xinzhu
Zhao, Dawei
Dai, Bin
Liu, Huaping
[J]. 2023 IEEE INTERNATIONAL CONFERENCE ON DEVELOPMENT AND LEARNING, ICDL, 2023, : 383 - 390
[5] Self-Supervised Learning for Online Speaker Diarization
Chien, Jen-Tzung
Luo, Sixun
[J]. 2021 ASIA-PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE (APSIPA ASC), 2021, : 2036 - 2042
[6] ROBUST SPEAKER VERIFICATION WITH JOINT SELF-SUPERVISED AND SUPERVISED LEARNING
Wang, Kai
Zhang, Xiaolei
Zhang, Miao
Li, Yuguang
Lee, Jaeyun
Cho, Kiho
Park, Sung-UN
[J]. 2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 7637 - 7641
[7] Multi-Modal Perception Attention Network with Self-Supervised Learning for Audio-Visual Speaker Tracking
Li, Yidi
Liu, Hong
Tang, Hao
[J]. THIRTY-SIXTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE / THIRTY-FOURTH CONFERENCE ON INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE / THE TWELVETH SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2022, : 1456 - 1463
[8] ROBUST SELF-SUPERVISED SPEAKER REPRESENTATION LEARNING VIA INSTANCE MIX REGULARIZATION
Kang, Woo Hyun
Alam, Jahangir
Fathan, Abderrahim
[J]. 2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 6617 - 6621
[9] Visually Assisted Self-supervised Audio Speaker Localization and Tracking
Zhao, Jinzheng
Wu, Peipei
Goudarzi, Shidrokh
Liu, Xubo
Sun, Jianyuan
Xu, Yong
Wang, Wenwu
[J]. 2022 30TH EUROPEAN SIGNAL PROCESSING CONFERENCE (EUSIPCO 2022), 2022, : 787 - 791
[10] Multi-array multi-speaker tracking
Potamitis, I
Tremoulis, G
Fakotakis, N
[J]. TEXT, SPEECH AND DIALOGUE, PROCEEDINGS, 2003, 2807 : 206 - 213

← 1 2 3 4 5 →