Attention-Based Fusion of Ultrashort Voice Utterances and Depth Videos for Multimodal Person Identification

被引:1
|
作者
Moufidi, Abderrazzaq [1 ,2 ]
Rousseau, David [2 ]
Rasti, Pejman [1 ,2 ]
机构
[1] ESAIP, Ctr Etud & Rech Aide Decis CERADE, 18 Rue 8 Mai 1945, F-49124 St Barthelemy Anjou, France
[2] Univ Angers, Lab Angevin Rech Ingn Syst LARIS, UMR INRAe IRHS, 62 Ave Notre Dame Lac, F-49000 Angers, France
关键词
depth images; lip identification; speaker identification; late fusion; multimodality; spatiotemporal;
D O I
10.3390/s23135890
中图分类号
O65 [分析化学];
学科分类号
070302 ; 081704 ;
摘要
Multimodal deep learning, in the context of biometrics, encounters significant challenges due to the dependence on long speech utterances and RGB images, which are often impractical in certain situations. This paper presents a novel solution addressing these issues by leveraging ultrashort voice utterances and depth videos of the lip for person identification. The proposed method utilizes an amalgamation of residual neural networks to encode depth videos and a Time Delay Neural Network architecture to encode voice signals. In an effort to fuse information from these different modalities, we integrate self-attention and engineer a noise-resistant model that effectively manages diverse types of noise. Through rigorous testing on a benchmark dataset, our approach exhibits superior performance over existing methods, resulting in an average improvement of 10%. This method is notably efficient for scenarios where extended utterances and RGB images are unfeasible or unattainable. Furthermore, its potential extends to various multimodal applications beyond just person identification.
引用
收藏
页数:13
相关论文
共 50 条
  • [1] Attention-Based Multimodal Fusion for Video Description
    Hori, Chiori
    Hori, Takaaki
    Lee, Teng-Yok
    Zhang, Ziming
    Harsham, Bret
    Hershey, John R.
    Marks, Tim K.
    Sumi, Kazuhiko
    [J]. 2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2017, : 4203 - 4212
  • [2] Attention-based mechanism and feature fusion network for person re-identification
    An, Mingshou
    He, Yunchuan
    Lim, Hye-Youn
    Kang, Dae-Seong
    [J]. INTERNATIONAL JOURNAL OF WEB AND GRID SERVICES, 2024, 20 (01)
  • [3] Singing Voice Extraction with Attention-based Spectrograms Fusion
    Shi, Hao
    Wang, Longbiao
    Li, Sheng
    Ding, Chenchen
    Ge, Meng
    Li, Nan
    Dang, Jianwu
    Seki, Hiroshi
    [J]. INTERSPEECH 2020, 2020, : 2412 - 2416
  • [4] Hierarchical attention-based multimodal fusion for video captioning
    Wu, Chunlei
    Wei, Yiwei
    Chu, Xiaoliang
    Weichen, Sun
    Su, Fei
    Wang, Leiquan
    [J]. NEUROCOMPUTING, 2018, 315 : 362 - 370
  • [5] Hierarchical Attention-Based Multimodal Fusion Network for Video Emotion Recognition
    Liu, Xiaodong
    Li, Songyang
    Wang, Miao
    [J]. COMPUTATIONAL INTELLIGENCE AND NEUROSCIENCE, 2021, 2021
  • [6] Multimodal Alignment and Attention-Based Person Search via Natural Language Description
    Ji, Zhong
    Li, Shengjia
    [J]. IEEE INTERNET OF THINGS JOURNAL, 2020, 7 (11) : 11147 - 11156
  • [7] Dual attention-based method for occluded person re-identification
    Xu, Yunjie
    Zhao, Liaoying
    Qin, Feiwei
    [J]. KNOWLEDGE-BASED SYSTEMS, 2021, 212
  • [8] Attention-Based Neural Architecture Search for Person Re-Identification
    Zhou, Qinqin
    Zhong, Bineng
    Liu, Xin
    Ji, Rongrong
    [J]. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2022, 33 (11) : 6627 - 6639
  • [9] Multimodal Sentiment Analysis Using BiGRU and Attention-Based Hybrid Fusion Strategy
    Liu, Zhizhong
    Zhou, Bin
    Meng, Lingqiang
    Huang, Guangyu
    [J]. INTELLIGENT AUTOMATION AND SOFT COMPUTING, 2023, 37 (02): : 1963 - 1981
  • [10] Group Gated Fusion on Attention-based Bidirectional Alignment for Multimodal Emotion Recognition
    Liu, Pengfei
    Li, Kun
    Meng, Helen
    [J]. INTERSPEECH 2020, 2020, : 379 - 383