AV2AV: Direct Audio-Visual Speech to Audio-Visual Speech Translation with Unified Audio-Visual Speech Representation

被引:0
|
作者
Choi, Jeongsoo [1 ]
Park, Se Jin [1 ]
Kim, Minsu [1 ]
Ro, Yong Man [1 ]
机构
[1] Korea Adv Inst Sci & Technol, Sch Elect Engn, Daejeon, South Korea
基金
新加坡国家研究基金会;
关键词
D O I
10.1109/CVPR52733.2024.02580
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
This paper proposes a novel direct Audio-Visual Speech to Audio-Visual Speech Translation (AV2AV) framework, where the input and output of the system are multimodal ( i.e., audio and visual speech). With the proposed AV2AV, two key advantages can be brought: 1) We can perform real-like conversations with individuals worldwide in a virtual meeting by utilizing our own primary languages. In contrast to Speech-to-Speech Translation (A2A), which solely translates between audio modalities, the proposed AV2AV directly translates between audio-visual speech. This capability enhances the dialogue experience by presenting synchronized lip movements along with the translated speech. 2) We can improve the robustness of the spoken language translation system. By employing the complementary information of audio-visual speech, the system can effectively translate spoken language even in the presence of acoustic noise, showcasing robust performance. To mitigate the problem of the absence of a parallel AV2AV translation dataset, we propose to train our spoken language translation system with the audio-only dataset of A2A. This is done by learning unified audio-visual speech representations through self-supervised learning in advance to train the translation system. Moreover, we propose an AV-Renderer that can generate raw audio and video in parallel. It is designed with zero-shot speaker modeling, thus the speaker in source audio-visual speech can be maintained at the target translated audio-visual speech. The effectiveness of AV2AV is evaluated with extensive experiments in a many-to-many language translation setting. Demo page is available on choijeongsoo.github.io/av2av.
引用
收藏
页码:27315 / 27327
页数:13
相关论文
共 50 条
  • [1] AV-TranSpeech: Audio-Visual Robust Speech-to-Speech Translation
    Huang, Rongjie
    Liu, Huadai
    Cheng, Xize
    Ren, Yi
    Li, Linjun
    Ye, Zhenhui
    He, Jinzheng
    Zhang, Lichao
    Liu, Jinglin
    Yin, Xiang
    Zhao, Zhou
    PROCEEDINGS OF THE 61ST ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2023): LONG PAPERS, VOL 1, 2023, : 8590 - 8604
  • [2] An audio-visual distance for audio-visual speech vector quantization
    Girin, L
    Foucher, E
    Feng, G
    1998 IEEE SECOND WORKSHOP ON MULTIMEDIA SIGNAL PROCESSING, 1998, : 523 - 528
  • [3] Audio-visual speech experience with age influences perceived audio-visual asynchrony in speech
    Alm, M. (magnus.alm@svt.ntnu.no), 1600, Acoustical Society of America (134):
  • [4] Audio-visual speech experience with age influences perceived audio-visual asynchrony in speech
    Alm, Magnus
    Behne, Dawn
    JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA, 2013, 134 (04): : 3001 - 3010
  • [5] An audio-visual speech recognition with a new mandarin audio-visual database
    Liao, Wen-Yuan
    Pao, Tsang-Long
    Chen, Yu-Te
    Chang, Tsun-Wei
    INT CONF ON CYBERNETICS AND INFORMATION TECHNOLOGIES, SYSTEMS AND APPLICATIONS/INT CONF ON COMPUTING, COMMUNICATIONS AND CONTROL TECHNOLOGIES, VOL 1, 2007, : 19 - +
  • [6] Expressive audio-visual speech
    Bevacqua, E
    Pelachaud, C
    COMPUTER ANIMATION AND VIRTUAL WORLDS, 2004, 15 (3-4) : 297 - 304
  • [7] Audio-Visual Speech Codecs: Rethinking Audio-Visual Speech Enhancement by Re-Synthesis
    Yang, Karren
    Markovic, Dejan
    Krenn, Steven
    Agrawal, Vasu
    Richard, Alexander
    2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2022, : 8217 - 8227
  • [8] Effects of aging on audio-visual speech integration Effects of aging on audio-visual speech integration
    Huyse, Aurelie
    Leybaert, Jacqueline
    Berthommier, Frederic
    JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA, 2014, 136 (04): : 1918 - 1931
  • [9] Audio-visual speech recognition based on joint training with audio-visual speech enhancement for robust speech recognition
    Hwang, Jung-Wook
    Park, Jeongkyun
    Park, Rae-Hong
    Park, Hyung-Min
    APPLIED ACOUSTICS, 2023, 211
  • [10] An audio-visual speech recognition system for testing new audio-visual databases
    Pao, Tsang-Long
    Liao, Wen-Yuan
    VISAPP 2006: PROCEEDINGS OF THE FIRST INTERNATIONAL CONFERENCE ON COMPUTER VISION THEORY AND APPLICATIONS, VOL 2, 2006, : 192 - +