AV2AV: Direct Audio-Visual Speech to Audio-Visual Speech Translation with Unified Audio-Visual Speech Representation

被引：0

作者：

Choi, Jeongsoo ^{[1
]}

Park, Se Jin ^{[1
]}

Kim, Minsu ^{[1
]}

Ro, Yong Man ^{[1
]}

机构：

[1] Korea Adv Inst Sci & Technol, Sch Elect Engn, Daejeon, South Korea

来源：

2024 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR) | 2024年

基金：

新加坡国家研究基金会;

关键词：

D O I：

10.1109/CVPR52733.2024.02580

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

This paper proposes a novel direct Audio-Visual Speech to Audio-Visual Speech Translation (AV2AV) framework, where the input and output of the system are multimodal ( i.e., audio and visual speech). With the proposed AV2AV, two key advantages can be brought: 1) We can perform real-like conversations with individuals worldwide in a virtual meeting by utilizing our own primary languages. In contrast to Speech-to-Speech Translation (A2A), which solely translates between audio modalities, the proposed AV2AV directly translates between audio-visual speech. This capability enhances the dialogue experience by presenting synchronized lip movements along with the translated speech. 2) We can improve the robustness of the spoken language translation system. By employing the complementary information of audio-visual speech, the system can effectively translate spoken language even in the presence of acoustic noise, showcasing robust performance. To mitigate the problem of the absence of a parallel AV2AV translation dataset, we propose to train our spoken language translation system with the audio-only dataset of A2A. This is done by learning unified audio-visual speech representations through self-supervised learning in advance to train the translation system. Moreover, we propose an AV-Renderer that can generate raw audio and video in parallel. It is designed with zero-shot speaker modeling, thus the speaker in source audio-visual speech can be maintained at the target translated audio-visual speech. The effectiveness of AV2AV is evaluated with extensive experiments in a many-to-many language translation setting. Demo page is available on choijeongsoo.github.io/av2av.

引用

页码：27315 / 27327

页数：13

共 50 条

[31] KAN-AV dataset for audio-visual face and speech analysis in the wild
Kefalas, Triantafyllos
Fotiadou, Eftychia
Georgopoulos, Markos
Panagakis, Yannis
Ma, Pingchuan
Petridis, Stavros
Stafylakis, Themos
Pantic, Maja
IMAGE AND VISION COMPUTING, 2023, 140
[32] Do gender differences in audio-visual benefit and visual influence in audio-visual speech perception emerge with age?
Alm, Magnus
Behne, Dawn
FRONTIERS IN PSYCHOLOGY, 2015, 6
[33] Somatosensory contribution to audio-visual speech processing
Ito, Takayuki
Ohashi, Hiroki
Gracco, Vincent L.
CORTEX, 2021, 143 : 195 - 204
[34] Some experiments in audio-visual speech processing
Chollet, G.
Landais, R.
Hueber, T.
Bredin, H.
Mokbel, C.
Perrot, P.
Zouari, L.
ADVANCES IN NONLINEAR SPEECH PROCESSING, 2007, 4885 : 28 - +
[35] Complementary models for audio-visual speech classification
Sad, Gonzalo D.
Terissi, Lucas D.
Gomez, Juan C.
INTERNATIONAL JOURNAL OF SPEECH TECHNOLOGY, 2022, 25 (01) : 231 - 249
[36] Speaker independent audio-visual speech recognition
Zhang, Y
Levinson, S
Huang, T
2000 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO, PROCEEDINGS VOLS I-III, 2000, : 1073 - 1076
[37] A coupled HMM for audio-visual speech recognition
Nefian, AV
Liang, LH
Pi, XB
Xiaoxiang, L
Mao, C
Murphy, K
2002 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOLS I-IV, PROCEEDINGS, 2002, : 2013 - 2016
[38] AUDIO-VISUAL SPEECH PERCEPTION - A PRELIMINARY REPORT
EWERTSEN, HW
NIELSEN, HB
NIELSEN, SS
ACTA OTO-LARYNGOLOGICA, 1970, : 229 - &
[39] Complementary models for audio-visual speech classification
Gonzalo D. Sad
Lucas D. Terissi
Juan C. Gómez
International Journal of Speech Technology, 2022, 25 : 231 - 249
[40] Improved Lite Audio-Visual Speech Enhancement
Chuang, Shang-Yi
Wang, Hsin-Min
Tsao, Yu
IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2022, 30 : 1345 - 1359

← 1 2 3 4 5 →