Audio-Visual Speaker Verification via Joint Cross-Attention

被引:0
|
作者
Rajasekhar, Gnana Praveen [1 ]
Alam, Jahangir [1 ]
机构
[1] Comp Res Inst Montreal, Montreal, PQ H3N 1M3, Canada
来源
关键词
Cross-attention; Audio-visual fusion; Speaker verification; Joint-attention;
D O I
10.1007/978-3-031-48312-7_2
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Speaker verification has been widely explored using speech signals, which has shown significant improvement using deep models. Recently, there has been a surge in exploring faces and voices as they can offer more complementary and comprehensive information than relying only on a single modality of speech signals. Though current methods in the literature on the fusion of faces and voices have shown improvement over that of individual face or voice modalities, the potential of audiovisual fusion is not fully explored for speaker verification. Most of the existing methods based on audio-visual fusion either rely on score-level fusion or simple feature concatenation. In this work, we have explored cross-modal joint attention to fully leverage the inter-modal complementary information and the intra-modal information for speaker verification. Specifically, we estimate the cross-attention weights based on the correlation between the joint feature presentation and that of the individual feature representations in order to effectively capture both intramodal as well inter-modal relationships among the faces and voices. We have shown that efficiently leveraging the intra- and inter-modal relationships significantly improves the performance of audio-visual fusion for speaker verification. The performance of the proposed approach has been evaluated on the Voxceleb1 dataset. Results show that the proposed approach can significantly outperform the state-of-the-art methods of audio-visual fusion for speaker verification.
引用
收藏
页码:18 / 31
页数:14
相关论文
共 50 条
  • [31] Improved audio-visual speaker recognition via the use of a hybrid combination strategy
    Lucey, S
    Chen, TH
    [J]. AUDIO-AND VIDEO-BASED BIOMETRIC PERSON AUTHENTICATION, PROCEEDINGS, 2003, 2688 : 929 - 936
  • [32] SPOOFING DETECTION VIA SIMULTANEOUS VERIFICATION OF AUDIO-VISUAL SYNCHRONICITY AND TRANSCRIPTION
    Schoenherr, Lea
    Zeiler, Steffen
    Kolossa, Dorothea
    [J]. 2017 IEEE AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING WORKSHOP (ASRU), 2017, : 591 - 598
  • [33] Temporal Cross-Modal Attention for Audio-Visual Event Localization
    Nagasaki Y.
    Hayashi M.
    Kaneko N.
    Aoki Y.
    [J]. Seimitsu Kogaku Kaishi/Journal of the Japan Society for Precision Engineering, 2022, 88 (03): : 263 - 268
  • [34] Audio-Visual Salieny Network with Audio Attention Module
    Cheng, Shuaiyang
    Gao, Xing
    Song, Liang
    Xiahou, Jianbing
    [J]. PROCEEDINGS OF 2021 2ND INTERNATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE AND INFORMATION SYSTEMS (ICAIIS '21), 2021,
  • [35] A Visual Signal Reliability for Robust Audio-Visual Speaker Identification
    Tariquzzaman, Md.
    Kim, Jin Young
    Na, Seung You
    Kim, Hyoung-Gook
    Har, Dongsoo
    [J]. IEICE TRANSACTIONS ON INFORMATION AND SYSTEMS, 2011, E94D (10): : 2052 - 2055
  • [36] Neural Speaker Extraction with Speaker-Speech Cross-Attention Network
    Wang, Wupeng
    Xu, Chenglin
    Ge, Meng
    Li, Haizhou
    [J]. INTERSPEECH 2021, 2021, : 3535 - 3539
  • [37] The 'Audio-Visual Face Cover Corpus': Investigations into audio-visual speech and speaker recognition when the speaker's face is occluded by facewear
    Fecher, Natalie
    [J]. 13TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2012 (INTERSPEECH 2012), VOLS 1-3, 2012, : 2247 - 2250
  • [38] Joint watermarking of audio-visual data
    Dittmann, J
    Steinebach, M
    [J]. 2001 IEEE FOURTH WORKSHOP ON MULTIMEDIA SIGNAL PROCESSING, 2001, : 601 - 606
  • [39] AVA ACTIVE SPEAKER: AN AUDIO-VISUAL DATASET FOR ACTIVE SPEAKER DETECTION
    Roth, Joseph
    Chaudhuri, Sourish
    Klejch, Ondrej
    Marvin, Radhika
    Gallagher, Andrew
    Kaver, Liat
    Ramaswamy, Sharadh
    Stopczynski, Arkadiusz
    Schmid, Cordelia
    Xi, Zhonghua
    Pantofaru, Caroline
    [J]. 2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, : 4492 - 4496
  • [40] Joint Audio-Visual Deepfake Detection
    Zhou, Yipin
    Lim, Ser-Nam
    [J]. 2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, : 14780 - 14789