A Multimodal Approach for Detection and Assessment of Depression Using Text, Audio and Video

被引:0
|
作者
Zhang, Wei [1 ]
Mao, Kaining [1 ]
Chen, Jie [1 ,2 ]
机构
[1] Univ Alberta, Dept Elect & Comp Engn, Edmonton, AB T6G 2R3, Canada
[2] Fudan Univ, Acad Engn & Technol, Shanghai 200433, Peoples R China
来源
PHENOMICS | 2024年 / 4卷 / 3期
关键词
Automatic depression detection; Natural language processing; Machine learning; Deep learning; SUICIDE;
D O I
10.1007/s43657-023-00152-8
中图分类号
Q3 [遗传学];
学科分类号
071007 ; 090102 ;
摘要
Depression is one of the most common mental disorders, and rates of depression in individuals increase each year. Traditional diagnostic methods are primarily based on professional judgment, which is prone to individual bias. Therefore, it is crucial to design an effective and robust diagnostic method for automated depression detection. Current artificial intelligence approaches are limited in their abilities to extract features from long sentences. In addition, current models are not as robust with large input dimensions. To solve these concerns, a multimodal fusion model comprised of text, audio, and video for both depression detection and assessment tasks was developed. In the text modality, pre-trained sentence embedding was utilized to extract semantic representation along with Bidirectional long short-term memory (BiLSTM) to predict depression. This study also used Principal component analysis (PCA) to reduce the dimensionality of the input feature space and Support vector machine (SVM) to predict depression based on audio modality. In the video modality, Extreme gradient boosting (XGBoost) was employed to conduct both feature selection and depression detection. The final predictions were given by outputs of the different modalities with an ensemble voting algorithm. Experiments on the Distress analysis interview corpus wizard-of-Oz (DAIC-WOZ) dataset showed a great improvement of performance, with a weighted F1 score of 0.85, a Root mean square error (RMSE) of 5.57, and a Mean absolute error (MAE) of 4.48. Our proposed model outperforms the baseline in both depression detection and assessment tasks, and was shown to perform better than other existing state-of-the-art depression detection methods.
引用
收藏
页码:234 / 249
页数:16
相关论文
共 50 条
  • [31] Multimodal Speech Emotion Recognition using Cross Attention with Aligned Audio and Text
    Lee, Yoonhyung
    Yoon, Seunghyun
    Jung, Kyomin
    INTERSPEECH 2020, 2020, : 2717 - 2721
  • [32] Learning Relationships between Text, Audio, and Video via Deep Canonical Correlation for Multimodal Language Analysis
    Sun, Zhongkai
    Sarma, Prathusha K.
    Sethares, William A.
    Liang, Yingyu
    THIRTY-FOURTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, THE THIRTY-SECOND INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE CONFERENCE AND THE TENTH AAAI SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2020, 34 : 8992 - 8999
  • [33] Bridging Text and Video: A Universal Multimodal Transformer for Audio-Visual Scene-Aware Dialog
    Li, Zekang
    Li, Zongjia
    Zhang, Jinchao
    Feng, Yang
    Zhou, Jie
    IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2021, 29 : 2476 - 2483
  • [34] Multi-modal depression detection based on emotional audio and evaluation text
    Ye, Jiayu
    Yu, Yanhong
    Wang, Qingxiang
    Li, Wentao
    Liang, Hu
    Zheng, Yunshao
    Fu, Gang
    JOURNAL OF AFFECTIVE DISORDERS, 2021, 295 : 904 - 913
  • [35] Improved Multimodal Sentiment Detection Using Stressed Regions of Audio
    Abburi, Harika
    Shrivastava, Manish
    Gangashetty, Suryakanth V.
    PROCEEDINGS OF THE 2016 IEEE REGION 10 CONFERENCE (TENCON), 2016, : 2834 - 2837
  • [36] Multimodal approach by embedding text and graphs for the detection of abusive messages
    Cecillon, Noe
    Dufourcurrency, Richard
    Labatut, Vincent
    TRAITEMENT AUTOMATIQUE DES LANGUES, 2021, 62 (02): : 13 - 38
  • [37] A Robust Approach for Scene Text Detection and Tracking in Video
    Wang, Yang
    Wang, Lan
    Su, Feng
    ADVANCES IN MULTIMEDIA INFORMATION PROCESSING, PT III, 2018, 11166 : 303 - 314
  • [38] An Adaptive Text Detection Approach in Images and Video Frames
    Li, Minhua
    Wang, Chunheng
    2008 IEEE INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS, VOLS 1-8, 2008, : 72 - 77
  • [39] An Automatic Video Text Detection, Localization and Extraction Approach
    Zhu, Chengjun
    Ouyang, Yuanxin
    Gao, Lei
    Chen, Zhenyong
    Xiong, Zhang
    ADVANCED INTERNET BASED SYSTEMS AND APPLICATIONS, 2009, 4879 : 1 - 9
  • [40] Using Multimodal Contrastive Knowledge Distillation for Video-Text Retrieval
    Ma, Wentao
    Chen, Qingchao
    Zhou, Tongqing
    Zhao, Shan
    Cai, Zhiping
    IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2023, 33 (10) : 5486 - 5497