Attention-Based Fusion of Ultrashort Voice Utterances and Depth Videos for Multimodal Person Identification

被引:1
|
作者
Moufidi, Abderrazzaq [1 ,2 ]
Rousseau, David [2 ]
Rasti, Pejman [1 ,2 ]
机构
[1] ESAIP, Ctr Etud & Rech Aide Decis CERADE, 18 Rue 8 Mai 1945, F-49124 St Barthelemy Anjou, France
[2] Univ Angers, Lab Angevin Rech Ingn Syst LARIS, UMR INRAe IRHS, 62 Ave Notre Dame Lac, F-49000 Angers, France
关键词
depth images; lip identification; speaker identification; late fusion; multimodality; spatiotemporal;
D O I
10.3390/s23135890
中图分类号
O65 [分析化学];
学科分类号
070302 ; 081704 ;
摘要
Multimodal deep learning, in the context of biometrics, encounters significant challenges due to the dependence on long speech utterances and RGB images, which are often impractical in certain situations. This paper presents a novel solution addressing these issues by leveraging ultrashort voice utterances and depth videos of the lip for person identification. The proposed method utilizes an amalgamation of residual neural networks to encode depth videos and a Time Delay Neural Network architecture to encode voice signals. In an effort to fuse information from these different modalities, we integrate self-attention and engineer a noise-resistant model that effectively manages diverse types of noise. Through rigorous testing on a benchmark dataset, our approach exhibits superior performance over existing methods, resulting in an average improvement of 10%. This method is notably efficient for scenarios where extended utterances and RGB images are unfeasible or unattainable. Furthermore, its potential extends to various multimodal applications beyond just person identification.
引用
下载
收藏
页数:13
相关论文
共 50 条
  • [21] A hierarchical attention-based multimodal fusion framework for predicting the progression of Alzheimer's disease
    Lu, Peixin
    Hu, Lianting
    Mitelpunkt, Alexis
    Bhatnagar, Surbhi
    Lu, Long
    Liang, Huiying
    BIOMEDICAL SIGNAL PROCESSING AND CONTROL, 2024, 88
  • [22] GRPAFusion: A Gradient Residual and Pyramid Attention-Based Multiscale Network for Multimodal Image Fusion
    Wang, Jinxin
    Xi, Xiaoli
    Li, Dongmei
    Li, Fang
    Zhang, Guanxin
    ENTROPY, 2023, 25 (01)
  • [23] Attention-based multimodal contextual fusion for sentiment and emotion classification using bidirectional LSTM
    Mahesh G. Huddar
    Sanjeev S. Sannakki
    Vijay S. Rajpurohit
    Multimedia Tools and Applications, 2021, 80 : 13059 - 13076
  • [24] Multimodal Emotion Detection via Attention-Based Fusion of Extracted Facial and Speech Features
    Mamieva, Dilnoza
    Abdusalomov, Akmalbek Bobomirzaevich
    Kutlimuratov, Alpamis
    Muminov, Bahodir
    Whangbo, Taeg Keun
    SENSORS, 2023, 23 (12)
  • [25] Attention-Based Multimodal Fusion for Estimating Human Emotion in Real-World HRI
    Li, Yuanchao
    Zhao, Tianyu
    Shen, Xun
    HRI'20: COMPANION OF THE 2020 ACM/IEEE INTERNATIONAL CONFERENCE ON HUMAN-ROBOT INTERACTION, 2020, : 340 - 342
  • [26] Multimodal attention-based transformer for video captioning
    Hemalatha Munusamy
    Chandra Sekhar C
    Applied Intelligence, 2023, 53 : 23349 - 23368
  • [27] Multimodal attention-based transformer for video captioning
    Munusamy, Hemalatha
    Sekhar, C. Chandra
    APPLIED INTELLIGENCE, 2023, 53 (20) : 23349 - 23368
  • [28] Attention-based Natural Language Person Retrieval
    Zhou, Tao
    Chen, Muhao
    Yu, Jie
    Terzopoulos, Demetri
    2017 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION WORKSHOPS (CVPRW), 2017, : 27 - 34
  • [29] Score Level Fusion Based Multimodal Biometric Identification (Fingerprint & Voice)
    Elmir, Youssef
    Elberrichi, Zakaria
    Adjoudj, Reda
    2012 6TH INTERNATIONAL CONFERENCE ON SCIENCES OF ELECTRONICS, TECHNOLOGIES OF INFORMATION AND TELECOMMUNICATIONS (SETIT), 2012, : 146 - 150
  • [30] Attention-based LSTM with Semantic Consistency for Videos Captioning
    Guo, Zhao
    Gao, Lianli
    Song, Jingkuan
    Xu, Xing
    Shao, Jie
    Shen, Heng Tao
    MM'16: PROCEEDINGS OF THE 2016 ACM MULTIMEDIA CONFERENCE, 2016, : 357 - 361