Spatial-Temporal Feature Network for Speech-Based Depression Recognition

被引:6
|
作者
Han, Zhuojin [1 ,2 ]
Shang, Yuanyuan [1 ,2 ]
Shao, Zhuhong [1 ,2 ]
Liu, Jingyi [2 ,3 ]
Guo, Guodong [4 ]
Liu, Tie [1 ,2 ]
Ding, Hui [1 ,2 ]
Hu, Qiang [5 ]
机构
[1] Capital Normal Univ, Coll Informat Engn, Beijing 100048, Peoples R China
[2] Capital Normal Univ, Beijing Key Lab Elect Syst Reliabil Technol, Beijing 100048, Peoples R China
[3] Capital Normal Univ, Sch Math Sci, Beijing 100048, Peoples R China
[4] West Virginia Univ, Lane Dept Comp Sci & Elect Engn, Morgantown, WV 26506 USA
[5] Zhenjiang Mental Hlth Ctr, Dept Psychiat, Zhenjiang 212000, Jiangsu, Peoples R China
基金
中国国家自然科学基金;
关键词
Convolutional neural network (CNN); deep learning; depression recognition; long short-term memory network; speech recognition;
D O I
10.1109/TCDS.2023.3273614
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Depression is a serious mental disorder that has received increased attention from society. Due to the advantage of easy acquisition of speech, researchers have tried to propose various automatic depression recognition algorithms based on speech. Feature selection and algorithm design are the main difficulties in speech-based depression recognition. In our work, we propose the spatial-temporal feature network (STFN) for depression recognition, which can capture the long-term temporal dependence of audio sequences. First, to obtain a better feature representation for depression analysis, we develop a self-supervised learning framework, called vector quantized wav2vec transformer net (VQWTNet) to map speech features and phonemes with transfer learning. Second, the stacked gated residual blocks in the spatial feature extraction network are used in the model to integrate causal and dilated convolutions to capture multiscale contextual information by continuously expanding the receptive field. In addition, instead of LSTM, our method employs the hierarchical contrastive predictive coding (HCPC) loss in HCPCNet to capture the long-term temporal dependencies of speech, reducing the number of parameters while making the network easier to train. Finally, experimental results on DAIC-WOZ (Audio/Visual Emotion Challenge (AVEC) 2017) and E-DAIC (AVEC 2019) show that the proposed model significantly improves the accuracy of depression recognition. On both data sets, the performance of our method far exceeds the baseline and achieves competitive results compared to state-of-the-art methods.
引用
收藏
页码:308 / 318
页数:11
相关论文
共 50 条
  • [21] Human interaction recognition using spatial-temporal salient feature
    Tao Hu
    Xinyan Zhu
    Shaohua Wang
    Lian Duan
    [J]. Multimedia Tools and Applications, 2019, 78 : 28715 - 28735
  • [22] Spatial-Temporal Convolutional Attention Network for Action Recognition
    Luo, Huilan
    Chen, Han
    [J]. Computer Engineering and Applications, 2023, 59 (09) : 150 - 158
  • [23] Spatial-Temporal Recurrent Neural Network for Emotion Recognition
    Zhang, Tong
    Zheng, Wenming
    Cui, Zhen
    Zong, Yuan
    Li, Yang
    [J]. IEEE TRANSACTIONS ON CYBERNETICS, 2019, 49 (03) : 839 - 847
  • [24] Spatial-Temporal Interleaved Network for Efficient Action Recognition
    Jiang, Shengqin
    Zhang, Haokui
    Qi, Yuankai
    Liu, Qingshan
    [J]. IEEE TRANSACTIONS ON INDUSTRIAL INFORMATICS, 2024,
  • [25] Contemporary Stochastic Feature Selection Algorithms for Speech-based Emotion Recognition
    Sidorov, Maxim
    Brester, Christina
    Schmitt, Alexander
    [J]. 16TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2015), VOLS 1-5, 2015, : 2699 - 2703
  • [26] Temporal Confusion Network for Speech-based Soccer Event Retrieval
    Pham, Nhut M.
    Vu, Quan H.
    [J]. 2013 INTERNATIONAL CONFERENCE ON ADVANCED TECHNOLOGIES FOR COMMUNICATIONS (ATC), 2013, : 549 - 553
  • [27] Human action recognition based on multi-mode spatial-temporal feature fusion
    Wang, Dongli
    Yang, Jun
    Zhou, Yan
    [J]. 2019 22ND INTERNATIONAL CONFERENCE ON INFORMATION FUSION (FUSION 2019), 2019,
  • [28] Action Recognition in Video using a Spatial-Temporal Graph-based Feature Representation
    Jargalsaikhan, Iveel
    Little, Suzanne
    Trichet, Remi
    O'Connor, Noel E.
    [J]. 2015 12TH IEEE INTERNATIONAL CONFERENCE ON ADVANCED VIDEO AND SIGNAL BASED SURVEILLANCE (AVSS), 2015,
  • [29] Investigation of speech-based language-independent possibilities of depression recognition
    Kiss, Gabor
    [J]. 2022 45TH INTERNATIONAL CONFERENCE ON TELECOMMUNICATIONS AND SIGNAL PROCESSING, TSP, 2022, : 226 - 229
  • [30] Continuous Sign Language Recognition Based on Spatial-Temporal Graph Attention Network
    Guo, Qi
    Zhang, Shujun
    Li, Hui
    [J]. CMES-COMPUTER MODELING IN ENGINEERING & SCIENCES, 2023, 134 (03): : 1653 - 1670