Towards Generating Diverse Audio Captions via Adversarial Training

被引:0
|
作者
Mei, Xinhao [1 ]
Liu, Xubo [1 ]
Sun, Jianyuan [1 ]
Plumbley, Mark D. [1 ]
Wang, Wenwu [1 ]
机构
[1] Univ Surrey, Ctr Vis Speech & Signal Proc, Guildford GU2 7XH, England
基金
英国工程与自然科学研究理事会;
关键词
Generators; Training; Measurement; Task analysis; Hybrid power systems; Semantics; Maximum likelihood estimation; Audio captioning; GANs; deep learning; cross-modal task; reinforcement learning;
D O I
10.1109/TASLP.2024.3416686
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Automated audio captioning is a cross-modal translation task for describing the content of audio clips with natural language sentences. This task has attracted increasing attention and substantial progress has been made in recent years. Captions generated by existing models are generally faithful to the content of audio clips, however, these machine-generated captions are often deterministic (e.g., generating a fixed caption for a given audio clip), simple (e.g., using common words and simple grammar), and generic (e.g., generating the same caption for similar audio clips). When people are asked to describe the content of an audio clip, different people tend to focus on different sound events and describe an audio clip diversely from various aspects using distinct words and grammar. We believe that an audio captioning system should have the ability to generate diverse captions, either for a fixed audio clip, or across similar audio clips. To this end, we propose an adversarial training framework based on a conditional generative adversarial network (C-GAN) to improve diversity of audio captioning systems. A caption generator and two hybrid discriminators compete and are learned jointly, where the caption generator can be any standard encoder-decoder captioning model used to generate captions, and the hybrid discriminators assess the generated captions from different criteria, such as their naturalness and semantics. We conduct experiments on the Clotho dataset. The results show that our proposed model can generate captions with better diversity as compared to state-of-the-art methods.
引用
收藏
页码:3311 / 3323
页数:13
相关论文
共 50 条
  • [1] Towards Generating Stylized Image Captions via Adversarial Training
    Nezami, Omid Mohamad
    Dras, Mark
    Wan, Stephen
    Paris, Cecile
    Hamey, Len
    PRICAI 2019: TRENDS IN ARTIFICIAL INTELLIGENCE, PT I, 2019, 11670 : 270 - 284
  • [2] DIVERSE AUDIO CAPTIONING VIA ADVERSARIAL TRAINING
    Mei, Xinhao
    Liu, Xubo
    Sun, Jianyuan
    Plumbley, Mark D.
    Wang, Wenwu
    2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 8882 - 8886
  • [3] Generating Accurate and Diverse Audio Captions Through Variational Autoencoder Framework
    Zhang, Yiming
    Du, Ruoyi
    Tan, Zheng-Hua
    Wang, Wenwu
    Ma, Zhanyu
    IEEE SIGNAL PROCESSING LETTERS, 2024, 31 : 2520 - 2524
  • [4] MusCaps: Generating Captions for Music Audio
    Manco, Ilaria
    Benetos, Emmanouil
    Quinton, Elio
    Fazekas, Gyorgy
    2021 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2021,
  • [5] Generating steganographic images via adversarial training
    Hayes, Jamie
    Danezis, George
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 30 (NIPS 2017), 2017, 30
  • [6] Towards the transferable audio adversarial attack via ensemble methods
    Guo, Feng
    Sun, Zheng
    Chen, Yuxuan
    Ju, Lei
    CYBERSECURITY, 2023, 6 (01)
  • [7] Towards the transferable audio adversarial attack via ensemble methods
    Feng Guo
    Zheng Sun
    Yuxuan Chen
    Lei Ju
    Cybersecurity, 6
  • [8] Generating Informative and Diverse Conversational Responses via Adversarial Information Maximization
    Zhang, Yizhe
    Galley, Michel
    Gao, Jianfeng
    Gan, Zhe
    Li, Xiujun
    Brockett, Chris
    Dolan, Bill
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 31 (NIPS 2018), 2018, 31
  • [9] Generating Diverse and Descriptive Image Captions Using Visual Paraphrases
    Liu, Lixin
    Tang, Jiajun
    Wan, Xiaojun
    Guo, Zongming
    2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019), 2019, : 4239 - 4248
  • [10] Towards Generating and Evaluating Iconographic Image Captions of Artworks
    Cetinic, Eva
    JOURNAL OF IMAGING, 2021, 7 (08)