Depression recognition using a proposed speech chain model fusing speech production and perception features

被引:16
|
作者
Du, Minghao [1 ]
Liu, Shuang [1 ]
Wang, Tao [1 ]
Zhang, Wenquan [1 ]
Ke, Yufeng [1 ]
Chen, Long [1 ]
Ming, Dong [1 ,2 ]
机构
[1] Tianjin Univ, Acad Med Engn & Translat Med, Tianjin Int Joint Res Ctr Neural Engn, Tianjin, Peoples R China
[2] Tianjin Univ, Dept Biomed Engn, Lab Neural Engn & Rehabil, Coll Precis Instruments & Optoelect Engn, Tianjin, Peoples R China
基金
中国国家自然科学基金;
关键词
Depression; Deep learning; Audio; Feature fusion; Auxiliary diagnosis; DISORDER; MACHINE;
D O I
10.1016/j.jad.2022.11.060
中图分类号
R74 [神经病学与精神病学];
学科分类号
摘要
Background: Increasing depression patients puts great pressure on clinical diagnosis. Audio-based diagnosis is a helpful auxiliary tool for early mass screening. However, current methods consider only speech perception features, ignoring patients' vocal tract changes, which may partly result in the poor recognition. Methods: This work proposes a novel machine speech chain model for depression recognition (MSCDR) that can capture text-independent depressive speech representation from the speaker's mouth to the listener's ear to improve recognition performance. In the proposed MSCDR, linear predictive coding (LPC) and Mel-frequency cepstral coefficients (MFCC) features are extracted to describe the processes of speech generation and of speech perception, respectively. Then, a one-dimensional convolutional neural network and a long short-term memory network sequentially capture intra- and inter-segment dynamic depressive features for classification. Results: We tested the MSCDR on two public datasets with different languages and paradigms, namely, the Distress Analysis Interview Corpus-Wizard of Oz and the Multi-modal Open Dataset for Mental-disorder Analysis. The accuracy of the MSCDR on the two datasets was 0.77 and 0.86, and the average F1 score was 0.75 and 0.86, which were better than the other existing methods. This improvement reveals the complementarity of speech production and perception features in carrying depressive information. Limitations: The sample size was relatively small, which may limit the application in clinical translation to some extent. Conclusion: This experiment proves the good generalization ability and superiority of the proposed MSCDR and suggests that the vocal tract changes in patients with depression deserve attention for audio-based depression diagnosis.
引用
收藏
页码:299 / 308
页数:10
相关论文
共 50 条
  • [1] Integrated-multilingual speech recognition using universal phonological features in a functional speech production model
    Deng, L
    1997 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOLS I - V: VOL I: PLENARY, EXPERT SUMMARIES, SPECIAL, AUDIO, UNDERWATER ACOUSTICS, VLSI; VOL II: SPEECH PROCESSING; VOL III: SPEECH PROCESSING, DIGITAL SIGNAL PROCESSING; VOL IV: MULTIDIMENSIONAL SIGNAL PROCESSING, NEURAL NETWORKS - VOL V: STATISTICAL SIGNAL AND ARRAY PROCESSING, APPLICATIONS, 1997, : 1007 - 1010
  • [2] Speech perception: a model of word recognition
    Luck, Jean-Marc
    Mehta, Anita
    EUROPEAN PHYSICAL JOURNAL B, 2025, 98 (02):
  • [3] CATEGORICAL FEATURES IN SPEECH-PERCEPTION AND PRODUCTION
    GOLDSTEIN, L
    JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA, 1978, 63 : S46 - S46
  • [4] CATEGORICAL FEATURES IN SPEECH-PERCEPTION AND PRODUCTION
    GOLDSTEIN, L
    JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA, 1980, 67 (04): : 1336 - 1348
  • [5] Speech-Visual Emotion Recognition by Fusing Shared and Specific Features
    Chen, Guanghui
    Jiao, Shuang
    IEEE SIGNAL PROCESSING LETTERS, 2023, 30 : 678 - 682
  • [6] Depressive Tendency Recognition by Fusing Speech and Text Features: A Comparative Analysis
    He, Yimin
    Lu, Xiaoyong
    Yuan, Jingyi
    Pan, Tao
    Wang, Yafan
    2022 13TH INTERNATIONAL SYMPOSIUM ON CHINESE SPOKEN LANGUAGE PROCESSING (ISCSLP), 2022, : 344 - 348
  • [7] Fusing audio and visual features of speech
    Pan, H
    Liang, ZP
    Huang, TS
    2000 INTERNATIONAL CONFERENCE ON IMAGE PROCESSING, VOL III, PROCEEDINGS, 2000, : 214 - 217
  • [8] The production and recognition of emotions in speech: features and algorithms
    Pierre-Yves, O
    INTERNATIONAL JOURNAL OF HUMAN-COMPUTER STUDIES, 2003, 59 (1-2) : 157 - 183
  • [9] Model compensation using robust features for robust speech recognition
    Zhang, Jun
    Wei, Gang
    Shuju Caiji Yu Chuli/Journal of Data Acquisition and Processing, 2003, 18 (03):
  • [10] Speech production modifies speech perception
    Sams, M
    Mottonen, R
    JOURNAL OF COGNITIVE NEUROSCIENCE, 2000, : 128 - 129