TAVT:Towards Transferable Audio-Visual Text Generation

被引:0
|
作者
Lin, Wang [1 ]
Jin, Tao [1 ]
Wang, Ye [1 ]
Pan, Wenwen [1 ]
Li, Linjun [1 ]
Cheng, Xize [1 ]
Zhao, Zhou [1 ]
机构
[1] Zhejiang Univ, Hangzhou, Peoples R China
基金
国家重点研发计划; 中国国家自然科学基金;
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Audio-visual text generation aims to understand multi-modality contents and translate them into texts. Although various transfer learning techniques of text generation have been proposed, they focused on uni-modal analysis (e.g., text-to-text, visual-to-text) and lack consideration of multi-modal content and cross-modal relation. Motivated by the fact that humans can recognize the timbre of the same low-level concepts (e.g., footstep, rainfall, and laughing), even in different visual conditions, we aim to mitigate the domain discrepancies by audio-visual correlation. In this paper, we propose a novel Transferable Audio-Visual Text Generation framework, named TAVT, which consists of two key components: Audio-Visual Meta-Mapper (AVMM) and Dual Counterfactual Contrastive Learning (DCCL). (1) AVMM first introduces a universal auditory semantic space and drifts the domain-invariant low-level concepts into visual prefixes. Then the reconstructbased learning encourages the AVMM to learn "which pixels belong to the same sound" and achieve audio-enhanced visual prefix. The well-trained AVMM can be further applied to unimodal setting. (2) Furthermore, DCCL leverages the destructive counterfactual transformations to provide cross-modal constraints for AVMM from the perspective of feature distribution and text generation. (3) The experimental results show that TAVT outperforms the state-of-the-art methods across multiple domains (cross-datasets, cross-categories) and various modal settings (uni-modal, multi-modal).
引用
收藏
页码:14983 / 14999
页数:17
相关论文
共 50 条
  • [1] Combining text and audio-visual features in video indexing
    Chang, SF
    Manmatha, R
    Chua, TS
    [J]. 2005 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOLS 1-5: SPEECH PROCESSING, 2005, : 1005 - 1008
  • [2] Multimedia: A traditional subject (Audio-visual, text, libraries)
    Melot, M
    [J]. DEGRES-REVUE DE SYNTHESE A ORIENTATION SEMIOLOGIQUE, 1998, (92-93): : B1 - B12
  • [3] The audio-visual text: Subtitling and dubbing different genres
    Pettit, Z
    [J]. META, 2004, 49 (01) : 25 - 38
  • [4] PREDICTING AUDIO-VISUAL SALIENT EVENTS BASED ON VISUAL, AUDIO AND TEXT MODALITIES FOR MOVIE SUMMARIZATION
    Koutras, P.
    Zlatintsi, A.
    Iosif, E.
    Katsamanis, A.
    Maragos, P.
    Potamianos, A.
    [J]. 2015 IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING (ICIP), 2015, : 4361 - 4365
  • [5] Towards Audio-Visual Saliency Prediction for Omnidirectional Video with Spatial Audio
    Chao, Fang-Yi
    Ozcinar, Cagri
    Zhang, Lu
    Hamidouche, Wassim
    Deforges, Olivier
    Smolic, Aljosa
    [J]. 2020 IEEE INTERNATIONAL CONFERENCE ON VISUAL COMMUNICATIONS AND IMAGE PROCESSING (VCIP), 2020, : 355 - 358
  • [6] An audio-visual distance for audio-visual speech vector quantization
    Girin, L
    Foucher, E
    Feng, G
    [J]. 1998 IEEE SECOND WORKSHOP ON MULTIMEDIA SIGNAL PROCESSING, 1998, : 523 - 528
  • [7] Catching audio-visual mice:: The extrapolation of audio-visual speed
    Hofbauer, MM
    Wuerger, SM
    Meyer, GF
    Röhrbein, F
    Schill, K
    Zetzsche, C
    [J]. PERCEPTION, 2003, 32 : 96 - 96
  • [8] Exploiting Audio-Visual Consistency with Partial Supervision for Spatial Audio Generation
    Lin, Yan-Bo
    Wang, Yu-Chiang Frank
    [J]. THIRTY-FIFTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, THIRTY-THIRD CONFERENCE ON INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE AND THE ELEVENTH SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2021, 35 : 2056 - 2063
  • [9] Towards Audio-Visual Cues for Cloud Infrastructure Monitoring
    Bermbach, David
    Eberhardt, Jacob
    [J]. PROCEEDINGS 2016 IEEE INTERNATIONAL CONFERENCE ON CLOUD ENGINEERING (IC2E), 2016, : 218 - 219
  • [10] Towards practical deployment of audio-visual speech recognition
    Potamianos, G
    Neti, C
    Huang, J
    Connell, JH
    Chu, S
    Libal, V
    Marcheret, E
    Haas, N
    Jiang, J
    [J]. 2004 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOL III, PROCEEDINGS: IMAGE AND MULTIDIMENSIONAL SIGNAL PROCESSING SPECIAL SESSIONS, 2004, : 777 - 780