A Multimodal Approach for Detection and Assessment of Depression Using Text, Audio and Video

被引：0

作者：

Zhang, Wei ^{[1
]}

Mao, Kaining ^{[1
]}

Chen, Jie ^{[1
,2
]}

机构：

[1] Univ Alberta, Dept Elect & Comp Engn, Edmonton, AB T6G 2R3, Canada

[2] Fudan Univ, Acad Engn & Technol, Shanghai 200433, Peoples R China

来源：

PHENOMICS | 2024年 / 4卷 / 3期

关键词：

Automatic depression detection; Natural language processing; Machine learning; Deep learning; SUICIDE;

D O I：

10.1007/s43657-023-00152-8

中图分类号：

Q3 [遗传学];

学科分类号：

071007 ; 090102 ;

摘要：

Depression is one of the most common mental disorders, and rates of depression in individuals increase each year. Traditional diagnostic methods are primarily based on professional judgment, which is prone to individual bias. Therefore, it is crucial to design an effective and robust diagnostic method for automated depression detection. Current artificial intelligence approaches are limited in their abilities to extract features from long sentences. In addition, current models are not as robust with large input dimensions. To solve these concerns, a multimodal fusion model comprised of text, audio, and video for both depression detection and assessment tasks was developed. In the text modality, pre-trained sentence embedding was utilized to extract semantic representation along with Bidirectional long short-term memory (BiLSTM) to predict depression. This study also used Principal component analysis (PCA) to reduce the dimensionality of the input feature space and Support vector machine (SVM) to predict depression based on audio modality. In the video modality, Extreme gradient boosting (XGBoost) was employed to conduct both feature selection and depression detection. The final predictions were given by outputs of the different modalities with an ensemble voting algorithm. Experiments on the Distress analysis interview corpus wizard-of-Oz (DAIC-WOZ) dataset showed a great improvement of performance, with a weighted F1 score of 0.85, a Root mean square error (RMSE) of 5.57, and a Mean absolute error (MAE) of 4.48. Our proposed model outperforms the baseline in both depression detection and assessment tasks, and was shown to perform better than other existing state-of-the-art depression detection methods.

引用

页码：234 / 249

页数：16

共 50 条

[31] Multimodal Speech Emotion Recognition using Cross Attention with Aligned Audio and Text
Lee, Yoonhyung
Yoon, Seunghyun
Jung, Kyomin
INTERSPEECH 2020, 2020, : 2717 - 2721
[32] Learning Relationships between Text, Audio, and Video via Deep Canonical Correlation for Multimodal Language Analysis
Sun, Zhongkai
Sarma, Prathusha K.
Sethares, William A.
Liang, Yingyu
THIRTY-FOURTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, THE THIRTY-SECOND INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE CONFERENCE AND THE TENTH AAAI SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2020, 34 : 8992 - 8999
[33] Bridging Text and Video: A Universal Multimodal Transformer for Audio-Visual Scene-Aware Dialog
Li, Zekang
Li, Zongjia
Zhang, Jinchao
Feng, Yang
Zhou, Jie
IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2021, 29 : 2476 - 2483
[34] Multi-modal depression detection based on emotional audio and evaluation text
Ye, Jiayu
Yu, Yanhong
Wang, Qingxiang
Li, Wentao
Liang, Hu
Zheng, Yunshao
Fu, Gang
JOURNAL OF AFFECTIVE DISORDERS, 2021, 295 : 904 - 913
[35] Improved Multimodal Sentiment Detection Using Stressed Regions of Audio
Abburi, Harika
Shrivastava, Manish
Gangashetty, Suryakanth V.
PROCEEDINGS OF THE 2016 IEEE REGION 10 CONFERENCE (TENCON), 2016, : 2834 - 2837
[36] Multimodal approach by embedding text and graphs for the detection of abusive messages
Cecillon, Noe
Dufourcurrency, Richard
Labatut, Vincent
TRAITEMENT AUTOMATIQUE DES LANGUES, 2021, 62 (02): : 13 - 38
[37] A Robust Approach for Scene Text Detection and Tracking in Video
Wang, Yang
Wang, Lan
Su, Feng
ADVANCES IN MULTIMEDIA INFORMATION PROCESSING, PT III, 2018, 11166 : 303 - 314
[38] An Adaptive Text Detection Approach in Images and Video Frames
Li, Minhua
Wang, Chunheng
2008 IEEE INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS, VOLS 1-8, 2008, : 72 - 77
[39] An Automatic Video Text Detection, Localization and Extraction Approach
Zhu, Chengjun
Ouyang, Yuanxin
Gao, Lei
Chen, Zhenyong
Xiong, Zhang
ADVANCED INTERNET BASED SYSTEMS AND APPLICATIONS, 2009, 4879 : 1 - 9
[40] Using Multimodal Contrastive Knowledge Distillation for Video-Text Retrieval
Ma, Wentao
Chen, Qingchao
Zhou, Tongqing
Zhao, Shan
Cai, Zhiping
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2023, 33 (10) : 5486 - 5497

← 1 2 3 4 5 →