Towards Generating Diverse Audio Captions via Adversarial Training

被引:0
|
作者
Mei, Xinhao [1 ]
Liu, Xubo [1 ]
Sun, Jianyuan [1 ]
Plumbley, Mark D. [1 ]
Wang, Wenwu [1 ]
机构
[1] Univ Surrey, Ctr Vis Speech & Signal Proc, Guildford GU2 7XH, England
基金
英国工程与自然科学研究理事会;
关键词
Generators; Training; Measurement; Task analysis; Hybrid power systems; Semantics; Maximum likelihood estimation; Audio captioning; GANs; deep learning; cross-modal task; reinforcement learning;
D O I
10.1109/TASLP.2024.3416686
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Automated audio captioning is a cross-modal translation task for describing the content of audio clips with natural language sentences. This task has attracted increasing attention and substantial progress has been made in recent years. Captions generated by existing models are generally faithful to the content of audio clips, however, these machine-generated captions are often deterministic (e.g., generating a fixed caption for a given audio clip), simple (e.g., using common words and simple grammar), and generic (e.g., generating the same caption for similar audio clips). When people are asked to describe the content of an audio clip, different people tend to focus on different sound events and describe an audio clip diversely from various aspects using distinct words and grammar. We believe that an audio captioning system should have the ability to generate diverse captions, either for a fixed audio clip, or across similar audio clips. To this end, we propose an adversarial training framework based on a conditional generative adversarial network (C-GAN) to improve diversity of audio captioning systems. A caption generator and two hybrid discriminators compete and are learned jointly, where the caption generator can be any standard encoder-decoder captioning model used to generate captions, and the hybrid discriminators assess the generated captions from different criteria, such as their naturalness and semantics. We conduct experiments on the Clotho dataset. The results show that our proposed model can generate captions with better diversity as compared to state-of-the-art methods.
引用
收藏
页码:3311 / 3323
页数:13
相关论文
共 50 条
  • [21] GODDAG: Generating Origin-Destination Flow for New Cities Via Domain Adversarial Training
    Rong, Can
    Feng, Jie
    Ding, Jingtao
    IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2023, 35 (10) : 10048 - 10057
  • [22] Generating robust real-time object detector with uncertainty via virtual adversarial training
    Chen, Yipeng
    Xu, Ke
    He, Di
    Ban, Xiaojuan
    INTERNATIONAL JOURNAL OF MACHINE LEARNING AND CYBERNETICS, 2022, 13 (02) : 431 - 445
  • [23] Generating robust real-time object detector with uncertainty via virtual adversarial training
    Yipeng Chen
    Ke Xu
    Di He
    Xiaojuan Ban
    International Journal of Machine Learning and Cybernetics, 2022, 13 : 431 - 445
  • [24] Towards Efficient and Effective Adversarial Training
    Sriramanan, Gaurang
    Addepalli, Sravanti
    Baburaj, Arya
    Babu, R. Venkatesh
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 34 (NEURIPS 2021), 2021,
  • [25] Towards Face Encryption by Generating Adversarial Identity Masks
    Yang, Xiao
    Dong, Yinpeng
    Pang, Tianyu
    Su, Hang
    Zhu, Jun
    Chen, Yuefeng
    Xue, Hui
    2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, : 3877 - 3887
  • [26] Structure Matters: Towards Generating Transferable Adversarial Images
    Peng, Dan
    Zheng, Zizhan
    Luo, Linhao
    Zhang, Xiaofeng
    ECAI 2020: 24TH EUROPEAN CONFERENCE ON ARTIFICIAL INTELLIGENCE, 2020, 325 : 1419 - 1426
  • [27] EFFICIENT ADVERSARIAL AUDIO SYNTHESIS VIA PROGRESSIVE UPSAMPLING
    Cho, Youngwoo
    Chang, Minwook
    Lee, Sanghyeon
    Lee, Hyoungwoo
    Kim, Gerard Jounghyun
    Choo, Jaegul
    2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 3410 - 3414
  • [28] Detecting Adversarial Audio via Activation Quantization Error
    Liu, Heng
    Ditzler, Gregory
    2020 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2020,
  • [29] Towards Understanding and Mitigating Audio Adversarial Examples for Speaker Recognition
    Chen, Guangke
    Zhao, Zhe
    Song, Fu
    Chen, Sen
    Fan, Lingling
    Wang, Feng
    Wang, Jiashui
    IEEE TRANSACTIONS ON DEPENDABLE AND SECURE COMPUTING, 2023, 20 (05) : 3970 - 3987
  • [30] Fight Fire with Fire: Towards Robust Recommender Systems via Adversarial Poisoning Training
    Wu, Chenwang
    Lian, Defu
    Ge, Yong
    Zhu, Zhihao
    Chen, Enhong
    Yuan, Senchao
    SIGIR '21 - PROCEEDINGS OF THE 44TH INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL, 2021, : 1074 - 1083