TAVT:Towards Transferable Audio-Visual Text Generation

被引：0

作者：

Lin, Wang ^{[1
]}

Jin, Tao ^{[1
]}

Wang, Ye ^{[1
]}

Pan, Wenwen ^{[1
]}

Li, Linjun ^{[1
]}

Cheng, Xize ^{[1
]}

Zhao, Zhou ^{[1
]}

机构：

[1] Zhejiang Univ, Hangzhou, Peoples R China

来源：

PROCEEDINGS OF THE 61ST ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2023): LONG PAPERS, VOL 1 | 2023年

基金：

国家重点研发计划; 中国国家自然科学基金;

关键词：

D O I：

暂无

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Audio-visual text generation aims to understand multi-modality contents and translate them into texts. Although various transfer learning techniques of text generation have been proposed, they focused on uni-modal analysis (e.g., text-to-text, visual-to-text) and lack consideration of multi-modal content and cross-modal relation. Motivated by the fact that humans can recognize the timbre of the same low-level concepts (e.g., footstep, rainfall, and laughing), even in different visual conditions, we aim to mitigate the domain discrepancies by audio-visual correlation. In this paper, we propose a novel Transferable Audio-Visual Text Generation framework, named TAVT, which consists of two key components: Audio-Visual Meta-Mapper (AVMM) and Dual Counterfactual Contrastive Learning (DCCL). (1) AVMM first introduces a universal auditory semantic space and drifts the domain-invariant low-level concepts into visual prefixes. Then the reconstructbased learning encourages the AVMM to learn "which pixels belong to the same sound" and achieve audio-enhanced visual prefix. The well-trained AVMM can be further applied to unimodal setting. (2) Furthermore, DCCL leverages the destructive counterfactual transformations to provide cross-modal constraints for AVMM from the perspective of feature distribution and text generation. (3) The experimental results show that TAVT outperforms the state-of-the-art methods across multiple domains (cross-datasets, cross-categories) and various modal settings (uni-modal, multi-modal).

引用

页码：14983 / 14999

页数：17

共 50 条

[1] Combining text and audio-visual features in video indexing
Chang, SF
Manmatha, R
Chua, TS
[J]. 2005 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOLS 1-5: SPEECH PROCESSING, 2005, : 1005 - 1008
[2] Multimedia: A traditional subject (Audio-visual, text, libraries)
Melot, M
[J]. DEGRES-REVUE DE SYNTHESE A ORIENTATION SEMIOLOGIQUE, 1998, (92-93): : B1 - B12
[3] The audio-visual text: Subtitling and dubbing different genres
Pettit, Z
[J]. META, 2004, 49 (01) : 25 - 38
[4] PREDICTING AUDIO-VISUAL SALIENT EVENTS BASED ON VISUAL, AUDIO AND TEXT MODALITIES FOR MOVIE SUMMARIZATION
Koutras, P.
Zlatintsi, A.
Iosif, E.
Katsamanis, A.
Maragos, P.
Potamianos, A.
[J]. 2015 IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING (ICIP), 2015, : 4361 - 4365
[5] Towards Audio-Visual Saliency Prediction for Omnidirectional Video with Spatial Audio
Chao, Fang-Yi
Ozcinar, Cagri
Zhang, Lu
Hamidouche, Wassim
Deforges, Olivier
Smolic, Aljosa
[J]. 2020 IEEE INTERNATIONAL CONFERENCE ON VISUAL COMMUNICATIONS AND IMAGE PROCESSING (VCIP), 2020, : 355 - 358
[6] An audio-visual distance for audio-visual speech vector quantization
Girin, L
Foucher, E
Feng, G
[J]. 1998 IEEE SECOND WORKSHOP ON MULTIMEDIA SIGNAL PROCESSING, 1998, : 523 - 528
[7] Catching audio-visual mice:: The extrapolation of audio-visual speed
Hofbauer, MM
Wuerger, SM
Meyer, GF
Röhrbein, F
Schill, K
Zetzsche, C
[J]. PERCEPTION, 2003, 32 : 96 - 96
[8] Exploiting Audio-Visual Consistency with Partial Supervision for Spatial Audio Generation
Lin, Yan-Bo
Wang, Yu-Chiang Frank
[J]. THIRTY-FIFTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, THIRTY-THIRD CONFERENCE ON INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE AND THE ELEVENTH SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2021, 35 : 2056 - 2063
[9] Towards Audio-Visual Cues for Cloud Infrastructure Monitoring
Bermbach, David
Eberhardt, Jacob
[J]. PROCEEDINGS 2016 IEEE INTERNATIONAL CONFERENCE ON CLOUD ENGINEERING (IC2E), 2016, : 218 - 219
[10] Towards practical deployment of audio-visual speech recognition
Potamianos, G
Neti, C
Huang, J
Connell, JH
Chu, S
Libal, V
Marcheret, E
Haas, N
Jiang, J
[J]. 2004 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOL III, PROCEEDINGS: IMAGE AND MULTIDIMENSIONAL SIGNAL PROCESSING SPECIAL SESSIONS, 2004, : 777 - 780

← 1 2 3 4 5 →