Automatic Voice Disorder Detection Using Self-Supervised Representations

被引:7
|
作者
Ribas, Dayana [1 ]
Pastor, Miguel A. [1 ]
Miguel, Antonio [1 ]
Martinez, David
Ortega, Alfonso [1 ,2 ]
Lleida, Eduardo [1 ]
机构
[1] Univ Zaragoza, Aragon Inst Engn Res I3A, ViVoLab, Zaragoza 50018, Spain
[2] Lumenvox, D-81379 Munich, Germany
基金
欧盟地平线“2020”;
关键词
Voice disorder; pathological speech; Saarbruecken voice database; advanced voice function assessment database; self-supervised; class token; transformer; deep neural networks; PATHOLOGY DETECTION; PREVALENCE;
D O I
10.1109/ACCESS.2023.3243986
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Many speech features and models, including Deep Neural Networks (DNN), are used for classification tasks between healthy and pathological speech with the Saarbruecken Voice Database (SVD). However, accuracy values of 80.71% for phrases or 82.8% for vowels /aiu/ are the highest reported for audio samples in SVD when the evaluation includes the wide amount of pathologies in the database, instead of a selection of some pathologies. This paper targets this top performance in the state-of-the-art Automatic Voice Disorder Detection (AVDD) systems. In the framework of a DNN-based AVDD system we study the capability of Self-Supervised (SS) representation learning for describing discriminative cues between healthy and pathological speech. The system processes the SS temporal sequence of features with a single feed-forward layer and Class-Token (CT) Transformer for obtaining the classification between healthy and pathological speech. Furthermore, there is evaluated a suitable data extension of the training set with out-ofdomain data is also evaluated to deal with the low availability of data for using DNN-based models in voice pathology detection. Experimental results using audio samples corresponding to phrases in the SVD dataset, including all pathologies available, show classification accuracy values until 93.36%. This means that the proposed AVDD system achieved accuracy improvements of 4.1% without the training data extension, and 15.62% after the training data extension compared to the baseline system. Beyond the novelty of using SS representations for AVDD, the fact of obtaining accuracies over 90% in these conditions and using the whole set of pathologies in the SVD is a milestone for voice disorder-related research. Furthermore, the study on the amount of in-domain data in the training set related to the system performance show guidance for the data preparation stage. Lessons learned in this work suggest guidelines for taking advantage of DNN, to boost the performance in developing automatic systems for diagnosis, treatment, and monitoring of voice pathologies.
引用
收藏
页码:14915 / 14927
页数:13
相关论文
共 50 条
  • [21] Contrast and Order Representations for Video Self-supervised Learning
    Hu, Kai
    Shao, Jie
    Liu, Yuan
    Raj, Bhiksha
    Savvides, Marios
    Shen, Zhiqiang
    2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, : 7919 - 7929
  • [22] MULTI-TASK VOICE ACTIVATED FRAMEWORK USING SELF-SUPERVISED LEARNING
    Hussain, Shehzeen
    Van Nguyen
    Zhang, Shuhua
    Visser, Erik
    2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 6137 - 6141
  • [23] Phonetic Analysis of Self-supervised Representations of English Speech
    Wells, Dan
    Tang, Hao
    Richmond, Korin
    INTERSPEECH 2022, 2022, : 3583 - 3587
  • [24] Self-supervised graph representations with generative adversarial learning
    Sun, Xuecheng
    Wang, Zonghui
    Lu, Zheming
    Lu, Ziqian
    NEUROCOMPUTING, 2024, 592
  • [25] Learning Action Representations for Self-supervised Visual Exploration
    Oh, Changjae
    Cavallaro, Andrea
    2019 INTERNATIONAL CONFERENCE ON ROBOTICS AND AUTOMATION (ICRA), 2019, : 5873 - 5879
  • [26] Self-supervised learning of Dynamic Representations for Static Images
    Song, Siyang
    Sanchez, Enrique
    Shen, Linlin
    Valstar, Michel
    2020 25TH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION (ICPR), 2021, : 1619 - 1626
  • [27] The Efficacy of Self-Supervised Speech Models as Audio Representations
    Wu, Tung-Yu
    Hsu, Tsu-Yuan
    Li, Chen-An
    Lin, Tzu-Han
    Lee, Hung-yi
    HEAR: HOLISTIC EVALUATION OF AUDIO REPRESENTATIONS, VOL 166, 2021, 166 : 90 - 110
  • [28] Federated Self-supervised Speech Representations: Are We There Yet?
    Gao, Yan
    Fernandez-Marques, Javier
    Parcollet, Titouan
    Mehrotra, Abhinav
    Lane, Nicholas D.
    INTERSPEECH 2022, 2022, : 3809 - 3813
  • [29] Deep Bregman divergence for self-supervised representations learning
    Rezaei, Mina
    Soleymani, Farzin
    Bischl, Bernd
    Azizi, Shekoofeh
    COMPUTER VISION AND IMAGE UNDERSTANDING, 2023, 235
  • [30] Self-Supervised Learning of Pretext-Invariant Representations
    Misra, Ishan
    van der Maaten, Laurens
    2020 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2020, : 6706 - 6716