Spatial-Temporal Feature Network for Speech-Based Depression Recognition

被引:6
|
作者
Han, Zhuojin [1 ,2 ]
Shang, Yuanyuan [1 ,2 ]
Shao, Zhuhong [1 ,2 ]
Liu, Jingyi [2 ,3 ]
Guo, Guodong [4 ]
Liu, Tie [1 ,2 ]
Ding, Hui [1 ,2 ]
Hu, Qiang [5 ]
机构
[1] Capital Normal Univ, Coll Informat Engn, Beijing 100048, Peoples R China
[2] Capital Normal Univ, Beijing Key Lab Elect Syst Reliabil Technol, Beijing 100048, Peoples R China
[3] Capital Normal Univ, Sch Math Sci, Beijing 100048, Peoples R China
[4] West Virginia Univ, Lane Dept Comp Sci & Elect Engn, Morgantown, WV 26506 USA
[5] Zhenjiang Mental Hlth Ctr, Dept Psychiat, Zhenjiang 212000, Jiangsu, Peoples R China
基金
中国国家自然科学基金;
关键词
Convolutional neural network (CNN); deep learning; depression recognition; long short-term memory network; speech recognition;
D O I
10.1109/TCDS.2023.3273614
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Depression is a serious mental disorder that has received increased attention from society. Due to the advantage of easy acquisition of speech, researchers have tried to propose various automatic depression recognition algorithms based on speech. Feature selection and algorithm design are the main difficulties in speech-based depression recognition. In our work, we propose the spatial-temporal feature network (STFN) for depression recognition, which can capture the long-term temporal dependence of audio sequences. First, to obtain a better feature representation for depression analysis, we develop a self-supervised learning framework, called vector quantized wav2vec transformer net (VQWTNet) to map speech features and phonemes with transfer learning. Second, the stacked gated residual blocks in the spatial feature extraction network are used in the model to integrate causal and dilated convolutions to capture multiscale contextual information by continuously expanding the receptive field. In addition, instead of LSTM, our method employs the hierarchical contrastive predictive coding (HCPC) loss in HCPCNet to capture the long-term temporal dependencies of speech, reducing the number of parameters while making the network easier to train. Finally, experimental results on DAIC-WOZ (Audio/Visual Emotion Challenge (AVEC) 2017) and E-DAIC (AVEC 2019) show that the proposed model significantly improves the accuracy of depression recognition. On both data sets, the performance of our method far exceeds the baseline and achieves competitive results compared to state-of-the-art methods.
引用
收藏
页码:308 / 318
页数:11
相关论文
共 50 条
  • [1] Robust speech recognition using spatial-temporal feature distribution characteristics
    Chen, Berlin
    Chen, Wei-Hau
    Lin, Shih-Hsiang
    Chu, Wen-Yi
    [J]. PATTERN RECOGNITION LETTERS, 2011, 32 (07) : 919 - 926
  • [2] Exploiting Spatial-Temporal Feature Distribution Characteristics for Robust Speech Recognition
    Chen, Wei-Hau
    Lin, Shih-Hsiang
    Chen, Berlin
    [J]. INTERSPEECH 2008: 9TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2008, VOLS 1-5, 2008, : 2004 - 2007
  • [3] Feature-level fusion based on spatial-temporal of pervasive EEG for depression recognition
    Zhang, Bingtao
    Wei, Dan
    Yan, Guanghui
    Lei, Tao
    Cai, Haishu
    Yang, Zhifei
    [J]. COMPUTER METHODS AND PROGRAMS IN BIOMEDICINE, 2022, 226
  • [4] Spatial-Temporal Feature Fusion Neural Network for EEG-Based Emotion Recognition
    Wang, Zhe
    Wang, Yongxiong
    Zhang, Jiapeng
    Hu, Chuanfei
    Yin, Zhong
    Song, Yu
    [J]. IEEE Transactions on Instrumentation and Measurement, 2022, 71
  • [5] Spatial-Temporal Feature Fusion Neural Network for EEG-Based Emotion Recognition
    Wang, Zhe
    Wang, Yongxiong
    Zhang, Jiapeng
    Hu, Chuanfei
    Yin, Zhong
    Song, Yu
    [J]. IEEE TRANSACTIONS ON INSTRUMENTATION AND MEASUREMENT, 2022, 71
  • [6] Spatial-Temporal Attention Network for Depression Recognition from facial videos
    Pan, Yuchen
    Shang, Yuanyuan
    Liu, Tie
    Shao, Zhuhong
    Guo, Guodong
    Ding, Hui
    Hu, Qiang
    [J]. EXPERT SYSTEMS WITH APPLICATIONS, 2024, 237
  • [7] Action Recognition by Joint Spatial-Temporal Motion Feature
    Zhang, Weihua
    Zhang, Yi
    Gao, Chaobang
    Zhou, Jiliu
    [J]. JOURNAL OF APPLIED MATHEMATICS, 2013,
  • [8] EEG-based mild depression recognition using multi-kernel convolutional and spatial-temporal Feature
    Fan, Yongheng
    Yu, Ruilan
    Li, Jianxiu
    Zhu, Jing
    Li, Xiaowei
    [J]. 2020 IEEE INTERNATIONAL CONFERENCE ON BIOINFORMATICS AND BIOMEDICINE, 2020, : 1777 - 1784
  • [9] Speech emotion recognition via multiple fusion under spatial-temporal parallel network
    Gan, Chenquan
    Wang, Kexin
    Zhu, Qingyi
    Xiang, Yong
    Jain, Deepak Kumar
    Garcia, Salvador
    [J]. NEUROCOMPUTING, 2023, 555
  • [10] Feature selection enhancement and feature space visualization for speech-based emotion recognition
    Kanwal, Sofia
    Asghar, Sohail
    Ali, Hazrat
    [J]. PeerJ Computer Science, 2022, 8