MULTI-SPEAKER PITCH TRACKING VIA EMBODIED SELF-SUPERVISED LEARNING

被引:1
|
作者
Li, Xiang [1 ]
Sun, Yifan
Wu, Xihong
Chen, Jing
机构
[1] Peking Univ, Speech & Hearing Res Ctr, Dept Machine Intelligence, Beijing, Peoples R China
基金
中国国家自然科学基金;
关键词
Multi-pitch tracking; self-supervised learning; speech perception; speech production; MULTIPITCH TRACKING; SPEECH;
D O I
10.1109/ICASSP43922.2022.9747262
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Pitch is a critical cue in human speech perception. Although the technology of tracking pitch in single-talker speech succeeds in many applications, it's still a challenging problem to extract pitch information from mixtures. Inspired by the motor theory of speech perception, a novel multi-speaker pitch tracking approach is proposed in this work, based on an embodied self-supervised learning method (EMSSL-Pitch). The conceptual idea is that speech is produced through an underlying physical process (i.e., human vocal tract) given the articulatory parameters (articulatory-to-acoustic), while speech perception is like the inverse process, aiming at perceiving the intended articulatory gestures of the speaker from acoustic signals (acoustic-to-articulatory). Pitch value is part of the articulatory parameters, corresponding to the vibration frequency of vocal folders. The acoustic-to-articulatory inversion is modeled in a self-supervised manner to learn an inference network by iteratively sampling and training. The learned representations from this inference network can have explicit physical meanings, i.e., articulatory parameters where pitch information can be further extracted. Experiments on GRID database show that EMSSL-Pitch can achieve a reachable performance compared with supervised baselines and be generalized to unseen speakers.
引用
收藏
页码:8257 / 8261
页数:5
相关论文
共 50 条
  • [1] Audio Mixing Inversion via Embodied Self-supervised Learning
    Zhou, Haotian
    Yu, Feng
    Wu, Xihong
    [J]. MACHINE INTELLIGENCE RESEARCH, 2024, 21 (01) : 55 - 62
  • [2] Audio Mixing Inversion via Embodied Self-supervised Learning
    Haotian Zhou
    Feng Yu
    Xihong Wu
    [J]. Machine Intelligence Research, 2024, 21 : 55 - 62
  • [3] Robust Multi-Speaker Tracking via Dictionary Learning and Identity Modeling
    Barnard, Mark
    Koniusz, Peter
    Wang, Wenwu
    Kittler, Josef
    Naqvi, Syed Mohsen
    Chambers, Jonathon
    [J]. IEEE TRANSACTIONS ON MULTIMEDIA, 2014, 16 (03) : 864 - 880
  • [4] Self-Supervised Embodied Learning for Semantic Segmentation
    Wang, Juan
    Liu, Xinzhu
    Zhao, Dawei
    Dai, Bin
    Liu, Huaping
    [J]. 2023 IEEE INTERNATIONAL CONFERENCE ON DEVELOPMENT AND LEARNING, ICDL, 2023, : 383 - 390
  • [5] Self-Supervised Learning for Online Speaker Diarization
    Chien, Jen-Tzung
    Luo, Sixun
    [J]. 2021 ASIA-PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE (APSIPA ASC), 2021, : 2036 - 2042
  • [6] ROBUST SPEAKER VERIFICATION WITH JOINT SELF-SUPERVISED AND SUPERVISED LEARNING
    Wang, Kai
    Zhang, Xiaolei
    Zhang, Miao
    Li, Yuguang
    Lee, Jaeyun
    Cho, Kiho
    Park, Sung-UN
    [J]. 2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 7637 - 7641
  • [7] Multi-Modal Perception Attention Network with Self-Supervised Learning for Audio-Visual Speaker Tracking
    Li, Yidi
    Liu, Hong
    Tang, Hao
    [J]. THIRTY-SIXTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE / THIRTY-FOURTH CONFERENCE ON INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE / THE TWELVETH SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2022, : 1456 - 1463
  • [8] ROBUST SELF-SUPERVISED SPEAKER REPRESENTATION LEARNING VIA INSTANCE MIX REGULARIZATION
    Kang, Woo Hyun
    Alam, Jahangir
    Fathan, Abderrahim
    [J]. 2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 6617 - 6621
  • [9] Visually Assisted Self-supervised Audio Speaker Localization and Tracking
    Zhao, Jinzheng
    Wu, Peipei
    Goudarzi, Shidrokh
    Liu, Xubo
    Sun, Jianyuan
    Xu, Yong
    Wang, Wenwu
    [J]. 2022 30TH EUROPEAN SIGNAL PROCESSING CONFERENCE (EUSIPCO 2022), 2022, : 787 - 791
  • [10] Multi-array multi-speaker tracking
    Potamitis, I
    Tremoulis, G
    Fakotakis, N
    [J]. TEXT, SPEECH AND DIALOGUE, PROCEEDINGS, 2003, 2807 : 206 - 213