AV2AV: Direct Audio-Visual Speech to Audio-Visual Speech Translation with Unified Audio-Visual Speech Representation

被引:0
|
作者
Choi, Jeongsoo [1 ]
Park, Se Jin [1 ]
Kim, Minsu [1 ]
Ro, Yong Man [1 ]
机构
[1] Korea Adv Inst Sci & Technol, Sch Elect Engn, Daejeon, South Korea
基金
新加坡国家研究基金会;
关键词
D O I
10.1109/CVPR52733.2024.02580
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
This paper proposes a novel direct Audio-Visual Speech to Audio-Visual Speech Translation (AV2AV) framework, where the input and output of the system are multimodal ( i.e., audio and visual speech). With the proposed AV2AV, two key advantages can be brought: 1) We can perform real-like conversations with individuals worldwide in a virtual meeting by utilizing our own primary languages. In contrast to Speech-to-Speech Translation (A2A), which solely translates between audio modalities, the proposed AV2AV directly translates between audio-visual speech. This capability enhances the dialogue experience by presenting synchronized lip movements along with the translated speech. 2) We can improve the robustness of the spoken language translation system. By employing the complementary information of audio-visual speech, the system can effectively translate spoken language even in the presence of acoustic noise, showcasing robust performance. To mitigate the problem of the absence of a parallel AV2AV translation dataset, we propose to train our spoken language translation system with the audio-only dataset of A2A. This is done by learning unified audio-visual speech representations through self-supervised learning in advance to train the translation system. Moreover, we propose an AV-Renderer that can generate raw audio and video in parallel. It is designed with zero-shot speaker modeling, thus the speaker in source audio-visual speech can be maintained at the target translated audio-visual speech. The effectiveness of AV2AV is evaluated with extensive experiments in a many-to-many language translation setting. Demo page is available on choijeongsoo.github.io/av2av.
引用
收藏
页码:27315 / 27327
页数:13
相关论文
共 50 条
  • [41] Audio-visual speech in noise perception in dyslexia
    van Laarhoven, Thijs
    Keetels, Mirjam
    Schakel, Lemmy
    Vroomen, Jean
    DEVELOPMENTAL SCIENCE, 2018, 21 (01)
  • [42] AUDIO-VISUAL SPEECH PROCESSING IN OLDER ADULTS
    Burke, K. E.
    Maguinness, C. T.
    Setti, A.
    Kenny, R. A.
    Newell, F. N.
    IRISH JOURNAL OF MEDICAL SCIENCE, 2010, 179 : S124 - S124
  • [43] Audio-Visual Deep Clustering for Speech Separation
    Lu, Rui
    Duan, Zhiyao
    Zhang, Changshui
    IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2019, 27 (11) : 1697 - 1712
  • [44] Boosted audio-visual HMM for speech reading
    Yin, P
    Essa, I
    Rehg, JM
    IEEE INTERNATIONAL WORKSHOP ON ANALYSIS AND MODELING OF FACE AND GESTURES, 2003, : 68 - 73
  • [45] The coordination of production and perception in audio-visual speech
    Vatikiotis-Bateson, E
    Munhall, KG
    INTERNATIONAL JOURNAL OF PSYCHOLOGY, 2000, 35 (3-4) : 281 - 281
  • [46] AUDIO-VISUAL SPEECH INPAINTING WITH DEEP LEARNING
    Morrone, Giovanni
    Michelsanti, Daniel
    Tan, Zheng-Hua
    Jensen, Jesper
    2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 6653 - 6657
  • [47] Audio-visual graphical models for speech processing
    Hershey, J
    Attias, H
    Jojic, N
    Kristjansson, T
    2004 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOL V, PROCEEDINGS: DESIGN AND IMPLEMENTATION OF SIGNAL PROCESSING SYSTEMS INDUSTRY TECHNOLOGY TRACKS MACHINE LEARNING FOR SIGNAL PROCESSING MULTIMEDIA SIGNAL PROCESSING SIGNAL PROCESSING FOR EDUCATION, 2004, : 649 - 652
  • [48] An asynchronous DBN for audio-visual speech recognition
    Saenko, Kate
    Livescu, Karen
    2006 IEEE SPOKEN LANGUAGE TECHNOLOGY WORKSHOP, 2006, : 154 - +
  • [49] Boosted audio-visual HMM for speech reading
    Yin, P
    Essa, I
    Rehg, JM
    CONFERENCE RECORD OF THE THIRTY-SEVENTH ASILOMAR CONFERENCE ON SIGNALS, SYSTEMS & COMPUTERS, VOLS 1 AND 2, 2003, : 2013 - 2018
  • [50] Improved Lite Audio-Visual Speech Enhancement
    Chuang, Shang-Yi
    Wang, Hsin-Min
    Tsao, Yu
    IEEE/ACM Transactions on Audio Speech and Language Processing, 2022, 30 : 1345 - 1359