Depression recognition using a proposed speech chain model fusing speech production and perception features

被引:16
|
作者
Du, Minghao [1 ]
Liu, Shuang [1 ]
Wang, Tao [1 ]
Zhang, Wenquan [1 ]
Ke, Yufeng [1 ]
Chen, Long [1 ]
Ming, Dong [1 ,2 ]
机构
[1] Tianjin Univ, Acad Med Engn & Translat Med, Tianjin Int Joint Res Ctr Neural Engn, Tianjin, Peoples R China
[2] Tianjin Univ, Dept Biomed Engn, Lab Neural Engn & Rehabil, Coll Precis Instruments & Optoelect Engn, Tianjin, Peoples R China
基金
中国国家自然科学基金;
关键词
Depression; Deep learning; Audio; Feature fusion; Auxiliary diagnosis; DISORDER; MACHINE;
D O I
10.1016/j.jad.2022.11.060
中图分类号
R74 [神经病学与精神病学];
学科分类号
摘要
Background: Increasing depression patients puts great pressure on clinical diagnosis. Audio-based diagnosis is a helpful auxiliary tool for early mass screening. However, current methods consider only speech perception features, ignoring patients' vocal tract changes, which may partly result in the poor recognition. Methods: This work proposes a novel machine speech chain model for depression recognition (MSCDR) that can capture text-independent depressive speech representation from the speaker's mouth to the listener's ear to improve recognition performance. In the proposed MSCDR, linear predictive coding (LPC) and Mel-frequency cepstral coefficients (MFCC) features are extracted to describe the processes of speech generation and of speech perception, respectively. Then, a one-dimensional convolutional neural network and a long short-term memory network sequentially capture intra- and inter-segment dynamic depressive features for classification. Results: We tested the MSCDR on two public datasets with different languages and paradigms, namely, the Distress Analysis Interview Corpus-Wizard of Oz and the Multi-modal Open Dataset for Mental-disorder Analysis. The accuracy of the MSCDR on the two datasets was 0.77 and 0.86, and the average F1 score was 0.75 and 0.86, which were better than the other existing methods. This improvement reveals the complementarity of speech production and perception features in carrying depressive information. Limitations: The sample size was relatively small, which may limit the application in clinical translation to some extent. Conclusion: This experiment proves the good generalization ability and superiority of the proposed MSCDR and suggests that the vocal tract changes in patients with depression deserve attention for audio-based depression diagnosis.
引用
收藏
页码:299 / 308
页数:10
相关论文
共 50 条
  • [31] Empirical Interpretation of Speech Emotion Perception with Attention Based Model for Speech Emotion Recognition
    Jalal, Md Asif
    Milner, Rosanna
    Hain, Thomas
    INTERSPEECH 2020, 2020, : 4113 - 4117
  • [32] Fusing features of speech for depression classification based on higher-order spectral analysis
    Miao, Xiaolin
    Li, Yao
    Wen, Min
    Liu, Yongyan
    Julian, Ibegbu Nnamdi
    Guo, Hao
    SPEECH COMMUNICATION, 2022, 143 : 46 - 56
  • [33] Speech Recognition with Word Fragment Detection Using Prosody Features for Spontaneous Speech
    Yeh, Jui-Feng
    Yen, Ming-Chi
    APPLIED MATHEMATICS & INFORMATION SCIENCES, 2012, 6 (02): : 669S - 675S
  • [34] Multi-Modal Emotion Recognition by Fusing Correlation Features of Speech-Visual
    Chen Guanghui
    Zeng Xiaoping
    IEEE SIGNAL PROCESSING LETTERS, 2021, 28 : 533 - 537
  • [35] Speech Databases, Speech Features, and Classifiers in Speech Emotion Recognition: A Review
    Dar, G. H. Mohmad
    Delhibabu, Radhakrishnan
    IEEE ACCESS, 2024, 12 : 151122 - 151152
  • [36] Gender opposition recognition method fusing emojis and multi-features in Chinese speech
    Zhang, Shunxiang
    Ma, Zichen
    Li, Hanchen
    Liu, Yunduo
    Chen, Lei
    Li, Kuan-Ching
    Soft Computing, 2025, 29 (04) : 2379 - 2390
  • [37] Using visible speech to train perception and production of speech for individuals with hearing loss
    Massaro, DW
    Light, J
    JOURNAL OF SPEECH LANGUAGE AND HEARING RESEARCH, 2004, 47 (02): : 304 - 320
  • [38] Tree-based Context Clustering Using Speech Recognition Features for Acoustic Model Training of Speech Synthesis
    Chanjaradwichai, Supadaech
    Suchato, Atiwong
    Punyabukkana, Proadpran
    2015 12TH INTERNATIONAL CONFERENCE ON ELECTRICAL ENGINEERING/ELECTRONICS, COMPUTER, TELECOMMUNICATIONS AND INFORMATION TECHNOLOGY (ECTI-CON), 2015,
  • [39] Speech production knowledge in automatic speech recognition
    King, Simon
    Frankel, Joe
    Livescu, Karen
    McDermott, Erik
    Richmond, Korin
    Wester, Mirjam
    JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA, 2007, 121 (02): : 723 - 742
  • [40] Speech production parameters for automatic speech recognition
    McGowan, RS
    Faber, A
    JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA, 1997, 101 (01): : 28 - 28