Attention-Based Fusion of Ultrashort Voice Utterances and Depth Videos for Multimodal Person Identification

被引：1

作者：

Moufidi, Abderrazzaq ^{[1
,2
]}

Rousseau, David ^{[2
]}

Rasti, Pejman ^{[1
,2
]}

机构：

[1] ESAIP, Ctr Etud & Rech Aide Decis CERADE, 18 Rue 8 Mai 1945, F-49124 St Barthelemy Anjou, France

[2] Univ Angers, Lab Angevin Rech Ingn Syst LARIS, UMR INRAe IRHS, 62 Ave Notre Dame Lac, F-49000 Angers, France

来源：

SENSORS | 2023年 / 23卷 / 13期

关键词：

depth images; lip identification; speaker identification; late fusion; multimodality; spatiotemporal;

D O I：

10.3390/s23135890

中图分类号：

O65 [分析化学];

学科分类号：

070302 ; 081704 ;

摘要：

Multimodal deep learning, in the context of biometrics, encounters significant challenges due to the dependence on long speech utterances and RGB images, which are often impractical in certain situations. This paper presents a novel solution addressing these issues by leveraging ultrashort voice utterances and depth videos of the lip for person identification. The proposed method utilizes an amalgamation of residual neural networks to encode depth videos and a Time Delay Neural Network architecture to encode voice signals. In an effort to fuse information from these different modalities, we integrate self-attention and engineer a noise-resistant model that effectively manages diverse types of noise. Through rigorous testing on a benchmark dataset, our approach exhibits superior performance over existing methods, resulting in an average improvement of 10%. This method is notably efficient for scenarios where extended utterances and RGB images are unfeasible or unattainable. Furthermore, its potential extends to various multimodal applications beyond just person identification.

引用

下载

页数：13

共 50 条

[21] A hierarchical attention-based multimodal fusion framework for predicting the progression of Alzheimer's disease
Lu, Peixin
Hu, Lianting
Mitelpunkt, Alexis
Bhatnagar, Surbhi
Lu, Long
Liang, Huiying
BIOMEDICAL SIGNAL PROCESSING AND CONTROL, 2024, 88
[22] GRPAFusion: A Gradient Residual and Pyramid Attention-Based Multiscale Network for Multimodal Image Fusion
Wang, Jinxin
Xi, Xiaoli
Li, Dongmei
Li, Fang
Zhang, Guanxin
ENTROPY, 2023, 25 (01)
[23] Attention-based multimodal contextual fusion for sentiment and emotion classification using bidirectional LSTM
Mahesh G. Huddar
Sanjeev S. Sannakki
Vijay S. Rajpurohit
Multimedia Tools and Applications, 2021, 80 : 13059 - 13076
[24] Multimodal Emotion Detection via Attention-Based Fusion of Extracted Facial and Speech Features
Mamieva, Dilnoza
Abdusalomov, Akmalbek Bobomirzaevich
Kutlimuratov, Alpamis
Muminov, Bahodir
Whangbo, Taeg Keun
SENSORS, 2023, 23 (12)
[25] Attention-Based Multimodal Fusion for Estimating Human Emotion in Real-World HRI
Li, Yuanchao
Zhao, Tianyu
Shen, Xun
HRI'20: COMPANION OF THE 2020 ACM/IEEE INTERNATIONAL CONFERENCE ON HUMAN-ROBOT INTERACTION, 2020, : 340 - 342
[26] Multimodal attention-based transformer for video captioning
Hemalatha Munusamy
Chandra Sekhar C
Applied Intelligence, 2023, 53 : 23349 - 23368
[27] Multimodal attention-based transformer for video captioning
Munusamy, Hemalatha
Sekhar, C. Chandra
APPLIED INTELLIGENCE, 2023, 53 (20) : 23349 - 23368
[28] Attention-based Natural Language Person Retrieval
Zhou, Tao
Chen, Muhao
Yu, Jie
Terzopoulos, Demetri
2017 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION WORKSHOPS (CVPRW), 2017, : 27 - 34
[29] Score Level Fusion Based Multimodal Biometric Identification (Fingerprint & Voice)
Elmir, Youssef
Elberrichi, Zakaria
Adjoudj, Reda
2012 6TH INTERNATIONAL CONFERENCE ON SCIENCES OF ELECTRONICS, TECHNOLOGIES OF INFORMATION AND TELECOMMUNICATIONS (SETIT), 2012, : 146 - 150
[30] Attention-based LSTM with Semantic Consistency for Videos Captioning
Guo, Zhao
Gao, Lianli
Song, Jingkuan
Xu, Xing
Shao, Jie
Shen, Heng Tao
MM'16: PROCEEDINGS OF THE 2016 ACM MULTIMEDIA CONFERENCE, 2016, : 357 - 361

← 1 2 3 4 5 →