Attention-Based Fusion of Ultrashort Voice Utterances and Depth Videos for Multimodal Person Identification

被引：1

作者：

Moufidi, Abderrazzaq ^{[1
,2
]}

Rousseau, David ^{[2
]}

Rasti, Pejman ^{[1
,2
]}

机构：

[1] ESAIP, Ctr Etud & Rech Aide Decis CERADE, 18 Rue 8 Mai 1945, F-49124 St Barthelemy Anjou, France

[2] Univ Angers, Lab Angevin Rech Ingn Syst LARIS, UMR INRAe IRHS, 62 Ave Notre Dame Lac, F-49000 Angers, France

来源：

SENSORS | 2023年 / 23卷 / 13期

关键词：

depth images; lip identification; speaker identification; late fusion; multimodality; spatiotemporal;

D O I：

10.3390/s23135890

中图分类号：

O65 [分析化学];

学科分类号：

070302 ; 081704 ;

摘要：

Multimodal deep learning, in the context of biometrics, encounters significant challenges due to the dependence on long speech utterances and RGB images, which are often impractical in certain situations. This paper presents a novel solution addressing these issues by leveraging ultrashort voice utterances and depth videos of the lip for person identification. The proposed method utilizes an amalgamation of residual neural networks to encode depth videos and a Time Delay Neural Network architecture to encode voice signals. In an effort to fuse information from these different modalities, we integrate self-attention and engineer a noise-resistant model that effectively manages diverse types of noise. Through rigorous testing on a benchmark dataset, our approach exhibits superior performance over existing methods, resulting in an average improvement of 10%. This method is notably efficient for scenarios where extended utterances and RGB images are unfeasible or unattainable. Furthermore, its potential extends to various multimodal applications beyond just person identification.

引用

下载

页数：13

共 50 条

[1] Attention-Based Multimodal Fusion for Video Description
Hori, Chiori
Hori, Takaaki
Lee, Teng-Yok
Zhang, Ziming
Harsham, Bret
Hershey, John R.
Marks, Tim K.
Sumi, Kazuhiko
2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2017, : 4203 - 4212
[2] Attention-based mechanism and feature fusion network for person re-identification
An, Mingshou
He, Yunchuan
Lim, Hye-Youn
Kang, Dae-Seong
INTERNATIONAL JOURNAL OF WEB AND GRID SERVICES, 2024, 20 (01)
[3] Singing Voice Extraction with Attention-based Spectrograms Fusion
Shi, Hao
Wang, Longbiao
Li, Sheng
Ding, Chenchen
Ge, Meng
Li, Nan
Dang, Jianwu
Seki, Hiroshi
INTERSPEECH 2020, 2020, : 2412 - 2416
[4] Hierarchical attention-based multimodal fusion for video captioning
Wu, Chunlei
Wei, Yiwei
Chu, Xiaoliang
Weichen, Sun
Su, Fei
Wang, Leiquan
NEUROCOMPUTING, 2018, 315 : 362 - 370
[5] Multimodal Alignment and Attention-Based Person Search via Natural Language Description
Ji, Zhong
Li, Shengjia
IEEE INTERNET OF THINGS JOURNAL, 2020, 7 (11) : 11147 - 11156
[6] Hierarchical Attention-Based Multimodal Fusion Network for Video Emotion Recognition
Liu, Xiaodong
Li, Songyang
Wang, Miao
COMPUTATIONAL INTELLIGENCE AND NEUROSCIENCE, 2021, 2021
[7] Dual attention-based method for occluded person re-identification
Xu, Yunjie
Zhao, Liaoying
Qin, Feiwei
KNOWLEDGE-BASED SYSTEMS, 2021, 212
[8] Attention-Based Neural Architecture Search for Person Re-Identification
Zhou, Qinqin
Zhong, Bineng
Liu, Xin
Ji, Rongrong
IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2022, 33 (11) : 6627 - 6639
[9] Multimodal Sentiment Analysis Using BiGRU and Attention-Based Hybrid Fusion Strategy
Liu, Zhizhong
Zhou, Bin
Meng, Lingqiang
Huang, Guangyu
INTELLIGENT AUTOMATION AND SOFT COMPUTING, 2023, 37 (02): : 1963 - 1981
[10] Group Gated Fusion on Attention-based Bidirectional Alignment for Multimodal Emotion Recognition
Liu, Pengfei
Li, Kun
Meng, Helen
INTERSPEECH 2020, 2020, : 379 - 383

← 1 2 3 4 5 →