VocaLiST: An Audio-Visual Synchronisation Model for Lips and Voices

被引:2
|
作者
Kadandale, Venkatesh S. [1 ]
Montesinos, Juan F. [1 ]
Haro, Gloria [1 ]
机构
[1] Univ Pompeu Fabra, Dept Informat & Commun Technol, Barcelona, Spain
来源
基金
欧盟地平线“2020”;
关键词
audio-visual; speech; singing voice; synchronisation; source separation; self-supervision; cross-modal;
D O I
10.21437/Interspeech.2022-10861
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
In this paper, we address the problem of lip-voice synchronisation in videos containing human face and voice. Our approach is based on determining if the lips motion and the voice in a video are synchronised or not, depending on their audiovisual correspondence score. We propose an audio-visual cross-modal transformer-based model that outperforms several baseline models in the audio-visual synchronisation task on the standard lip-reading speech benchmark dataset LRS2. While the existing methods focus mainly on lip synchronisation in speech videos, we also consider the special case of the singing voice. The singing voice is a more challenging use case for synchronisation due to sustained vowel sounds. We also investigate the relevance of lip synchronisation models trained on speech datasets in the context of singing voice. Finally, we use the frozen visual features learned by our lip synchronisation model in the singing voice separation task to outperform a baseline audio-visual model which was trained end-to-end. The demos, source code, and the pre-trained models are available on https://ipcv.github.io/VocaLiST/
引用
收藏
页码:3128 / 3132
页数:5
相关论文
共 50 条
  • [1] Audio-Visual Synchronisation for Speaker Diarisation
    Garau, Giulia
    Dielmann, Alfred
    Bourlard, Herve
    [J]. 11TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2010 (INTERSPEECH 2010), VOLS 3 AND 4, 2010, : 2662 - +
  • [2] Audio-Visual Synchronisation in Quantum Movies
    Yan, Fei
    Iliyasu, Abdullah M.
    Jiao, Sihao
    Yang, Huamin
    [J]. 2018 IEEE 5TH INTERNATIONAL CONGRESS ON INFORMATION SCIENCE AND TECHNOLOGY (IEEE CIST'18), 2018, : 274 - 278
  • [3] Lips Detection for Audio-Visual Speech Recognition System
    Chin, Siew Wen
    Ang, Li-Minn
    Seng, Kah Phooi
    [J]. 2008 INTERNATIONAL SYMPOSIUM ON INTELLIGENT SIGNAL PROCESSING AND COMMUNICATIONS SYSTEMS (ISPACS 2008), 2008, : 311 - 314
  • [4] Visualized voices: A case study of audio-visual synesthesia
    Fernay, Louise
    Reby, David
    Ward, Jamie
    [J]. NEUROCASE, 2012, 18 (01) : 50 - 56
  • [5] Restoring Speaking Lips from Occlusion for Audio-Visual Speech Recognition
    Wang, Jiadong
    Pan, Zexu
    Zhang, Malu
    Tan, Robby T.
    Li, Haizhou
    [J]. THIRTY-EIGHTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 38 NO 17, 2024, : 19144 - 19152
  • [6] My lips are concealed: Audio-visual speech enhancement through obstructions
    Afouras, Triantafyllos
    Chung, Joon Son
    Zisserman, Andrew
    [J]. INTERSPEECH 2019, 2019, : 4295 - 4299
  • [7] An Objective Model for Audio-Visual Quality
    Martinez, Helard Becerra
    Farias, Mylene C. Q.
    [J]. IMAGE QUALITY AND SYSTEM PERFORMANCE XI, 2014, 9016
  • [8] PERFECT MATCH: IMPROVED CROSS-MODAL EMBEDDINGS FOR AUDIO-VISUAL SYNCHRONISATION
    Chung, Soo-Whan
    Chung, Joon Son
    Kang, Hong-Goo
    [J]. 2019 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2019, : 3965 - 3969
  • [9] An audio-visual distance for audio-visual speech vector quantization
    Girin, L
    Foucher, E
    Feng, G
    [J]. 1998 IEEE SECOND WORKSHOP ON MULTIMEDIA SIGNAL PROCESSING, 1998, : 523 - 528
  • [10] Catching audio-visual mice:: The extrapolation of audio-visual speed
    Hofbauer, MM
    Wuerger, SM
    Meyer, GF
    Röhrbein, F
    Schill, K
    Zetzsche, C
    [J]. PERCEPTION, 2003, 32 : 96 - 96